Quality control of sc/snRNA-seq

Quality control of sc/snRNA-seq#

To perform quality control of single-cell or single-nuclei RNA sequencing (sc/snRNA-seq) we have the dotools_py.pp.importer_py() function. This function compiles different methods to process samples. We need to define a list of H5 files that have been generated from a mapping tool like CellRanger or STARsolo. If CellBender has been run, we can also provide the path to the H5 files generated by CellBender. Additionally, we need to provide the batch names for the samples, and additional metadata can be provided in the form of a dictionary.

The quality control involves filtering genes and cells:

Genes: are removed based on their expression levels. Genes expressed in low amount of cells are excluded. By default, we consider that a gene is excluded if it is expressed in less than 5 cells
Cells: are removed based on different parameters, including: number of genes, mitochondrial content, doublets and number of UMI counts.
- Mitochondrial content: cells with high mitochondrial content are excluded. We recomment assuming a maximum un 5 % for scRNA-seq and 3% for snRNA-seq
- Doublets: we implemented three different approaches for the identification of neotypic doublets (i.e, doublets originating from the combination of two or more different cell-types). The available implementations are scDblFinder, DoubletDetection and Scrublet.
- Number of genes: cells are removed by absolute number of genes. A lower and upper threshold can be set.
- Number of UMI counts: cells can be removed using two approaches: absolute or quantiles. A lower and upper threshold can be set and a combination of both approaches can be used (e.g., an absolute lower threshold and filter cells on the upper quantile).

After the quality control per sample, the individual samples will be combined into one AnnData object and log-normalisation, scaling and highly variable genes will be calculated. To evaluate the quality control the distribution of total UMI, number of genes and mitochondrial genes per cell will be plotted in a violin plot before and after the quality control. These plots will be saved in the folders containing the H5 files. Additionally, we also keep track on the number of cells and genes that have been removed in each quality control step.

First, we start setting up the environment and loading the required libraries

Environment setup#

import dotools_py as do
import pandas as pd
from IPython.display import display, SVG
import session_info

2026-03-05 17:09:39,729 - Jupyter enviroment detected. Using "inline" backend

To show how the quality control works, we are going to use a public dataset from 10X from human blood of healthy and donors with a malignant tumor. We get the raw and the filtered H5 files generated by 10X.

do.dt.example_10x("/Users/david/Downloads/PublicData10x")

2025-10-22 16:16:29,452 - Downloading data to /Users/david/Downloads/PublicData10x

Sequential preprocessing#

paths = [
    "/Users/david/Downloads/PublicData10x/healthy/outs/filtered_feature_bc_matrix.h5",
    "/Users/david/Downloads/PublicData10x/disease/outs/filtered_feature_bc_matrix.h5",
]

adata = do.pp.importer_py(
    paths=paths,
    ids=["batch1", "batch2"],
    metadata={"condition": ["healthy", "disease"]},  # Additional metadata information
    batch_key="batch",  # Column in obs to save batch information
    remove_doublets=True,
    doublet_tool="scDblFinder",  # Tool to identify neotypic doublets (Also available Scrublet and DoubletDetection)
    min_genes_in_cell=300,
    min_cells_with_genes=5,
    cut_mt=5,
    n_reads=10_000,
    min_counts=500,  # Filter cells with less than 500 genes
    high_quantile=95,  # Filter cells with the top 5% high number of UMI counts
)

2026-03-05 17:09:43,146 - For snRNA a lower 'cut_mt' is recommended since mitochondrial genes
 should not be highly expressed in the nuclei
2026-03-05 17:09:43,147 - QualityControl Plots will be saved in
/Users/david/Downloads/PublicData10x/healthy/outs
reading /Users/david/Downloads/PublicData10x/healthy/outs/filtered_feature_bc_matrix.h5
 (0:00:00)
2026-03-05 17:09:43,929 - Remove cells with low number of genes
filtered out 14 cells that have less than 300 genes expressed
2026-03-05 17:09:44,012 - Remove genes lowly expressed
filtered out 16694 genes that are detected in less than 5 cells
2026-03-05 17:09:44,121 - Remove cells with high Mt-content
2026-03-05 17:09:44,138 - Remove cells based on nUMI counts
2026-03-05 17:09:57,862 - Removed 94 doublets
2026-03-05 17:09:58,182 - QualityControl Plots will be saved in
/Users/david/Downloads/PublicData10x/disease/outs
reading /Users/david/Downloads/PublicData10x/disease/outs/filtered_feature_bc_matrix.h5
 (0:00:00)
2026-03-05 17:09:58,671 - Remove cells with low number of genes
filtered out 46 cells that have less than 300 genes expressed
2026-03-05 17:09:58,753 - Remove genes lowly expressed
filtered out 16250 genes that are detected in less than 5 cells
2026-03-05 17:09:58,845 - Remove cells with high Mt-content
2026-03-05 17:09:58,850 - Remove cells based on nUMI counts
2026-03-05 17:10:01,013 - Removed 27 doublets
2026-03-05 17:10:01,200 - Concatenating samples
2026-03-05 17:10:01,280 - Normalisation of the expression
normalizing counts per cell
    finished (0:00:00)
2026-03-05 17:10:01,302 - Finding Highly Variable Genes shared across samples
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
2026-03-05 17:10:01,575 - Run PCA
computing PCA
    with n_comps=50
    finished (0:00:00)

adata

AnnData object with n_obs × n_vars = 2774 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg'
    obsm: 'X_pca'
    layers: 'counts', 'logcounts'

Evaluation of the preprocessing#

We can now check the quality control plots that were generated:

files = [
    "/Users/david/Downloads/PublicData10x/healthy/outs/Vln_PreQC_batch1.svg",
    "//Users/david/Downloads/PublicData10x/healthy/outs/Vln_PostQC_batch1.svg",
    "/Users/david/Downloads/PublicData10x/healthy/outs/260305_QC_Metricsbatch1.svg"
]

display(
    SVG(files[0]),
    SVG(files[1]),
    SVG(files[2]),
)

../_images/e2b65c676cb011f6df16f85c8557778a717072528f4b12c4b33b5ee75c64567d.svg

../_images/17a5f8610ecde9ad8271763c80e3681d9e4d204726fb4399c8ea50511001fd27.svg

../_images/9f10c31d56adf8ea9ee207e68ab83b9e85b8311d9be3ce6702b49c94b8f7ae2a.svg

We can observe that the majority of cells were remove due to high mitochondrial content. Depending on the experimental set-up we might want to increase the threshold of mitochondrial content if we do not want to lose too many cells. Besides these plots, we also have an ExcelSheet that kept track on the thresholds used during the quality control.

table = pd.read_excel("/Users/david/Downloads/PublicData10x/healthy/outs/260305_Metrics_batch1.xlsx")
table

	QC_Step	nCells	nFeatures	Comments
0	Input_Shape	7865	33538	NaN
1	Rm_Cells_lowGenes	7851	33538	Remove cells with <300 genes
2	Rm_Genes_lowCells	7851	16844	Remove genes express in less than 5 cells
3	Rm_Cell_HighMT	2254	16844	Remove cells with >5% of Mitochondrial genes
4	Rm_Cells_nUMI_nGenes	2141	16844	Remove cells based on nUMI counts[Absolute (Mi...
5	Rm_Doublets	2047	16844	Remove neotypic doublets using scDblFinder

Integration and clustering#

After the quality control, we can now proceed to the batch correction and integration of the samples. For these, we can use different batch correction methods: Harmony, Scanorama, BBKNN, scVI or CCA from Seurat (v4 or v5 approach). After the integration of the samples, we run the Leiden algorithm to find clusters and generate the UMAP embeddings for visualisation.

do.tl.integrate_data(
    adata,
    batch_key="batch",
    hvg_batch=True,
    integration_method="cca5",
    resolution=0.3,  # Resolution for leiden algorithm
)

2026-03-05 17:10:37,951 - Computing HVGs
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
computing PCA
    with n_comps=50
    finished (0:00:00)
2026-03-05 17:10:38,652 - Integration using CCA (Seurat v5 approach)
2026-03-05 17:10:38,654 - Preprocessing to export to Seurat
2026-03-05 17:10:38,687 - Running CCA Integration
                          integratedcca_1 integratedcca_2 integratedcca_3
AAACCCAGTGCATTTG-1-batch1      -29.317470       0.3862413       -4.324833
AAACCCATCTCAACGA-1-batch1        4.571685      -1.2351645       -4.032547
AAACCCATCTCTCGAC-1-batch1        4.236896      -1.4202744       -3.751133
2026-03-05 17:11:00,514 - Loading corrected matrix
2026-03-05 17:11:00,689 - Finding neighbors
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
2026-03-05 17:11:02,938 - Run UMAP
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:02)
2026-03-05 17:11:05,779 - Clustering cells using Leiden (resolution 0.3)
running Leiden clustering
    finished: found 6 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

We can observe, that after the integration we have X_CCA in obsm. This is the CCA matrix after dimensionality reduction. Contrary to the approach in Seurat4 where the dimensions of this matrix is n_cells x n_hvg, in this case the dimension is n_cells x 50

adata.obsm["X_CCA"].shape

(2774, 50)

Evaluation of integration#

We can now visualise the integrated object and the identified clusters:

do.pl.split_embedding(adata, "batch", figsize=(8, 5))
do.pl.umap(adata, "leiden", labels="leiden", figsize=(6, 5))

../_images/6522e16d2dcab7499a6d4d803fc733d83b7814fa1df7c3998b50d8c572ebb4ce.png

../_images/8d66fabad2d545e13ce635dd05ab317e582996b87df510851b698ab531f2b635.png

<Figure size 320x400 with 0 Axes>

adata.write("/Users/david/Downloads/PublicData10x/adata.h5ad")

Semi-automatic annotation with CellTypist#

We also have the possibility to perform a semi-automatic annotation using CellTypist. In this case, we use the Adult_COVID19_PBMC.pkl model.

do.tl.auto_annot(adata, "leiden", model="Healthy_COVID19_PBMC.pkl", convert=False, pl_cell_prob=True)

../_images/5e6d12fb9eef090060f5e5d625103c21aedefd972a9fb5a5b4f55e2d6142e8d5.png

do.pl.umap(adata, "leiden", labels="autoAnnot")

../_images/00e59cde0969fdcbbaee101ffdcdd5e3c19978b4b690f95d5aa52af28cb84a44.png

<Figure size 320x400 with 0 Axes>

Besides the semi-automatic annotation, we should also validate the findings with known markers for these celltypes.

markers = {
    "ImmuneCells": ["PTPRC"],
    "B_cells": ["CD79A", "BANK1", "MS4A1"],
    "T_cells": ["CD3E", "CD4", "IL7R"],
    "NK": ["NKG7", "KLRD1"],
    "Myeloid": ["CD68", "CD14", "ITGAM"],
    "pDC": ["LILRA4", "CLEC4C", "LRRC26"],
}

do.pl.dotplot(adata, "leiden", markers, swap_axes=False, var_group_rotation=90)

../_images/703f92d9fd29fed9b6010e275f79d17cde4195abf67f6470577654f62a8f0e2b.png

Overall we can see an agreement with the annotation and can continue with the annotation.

adata.obs["annotation"] = adata.obs.leiden.map(
    {"0": "Monocytes", "1": "T_cells", "2": "T_cells", "3": "NK", "4": "B_cells", "5": "pDC"}
)
do.pl.umap(adata, "annotation", labels="annotation")

../_images/00cd20564414972f88027035c8d63f14e9a26963b9b3a0d2fe20c4a6a13dc144.png

<Figure size 320x400 with 0 Axes>

Evaluate changes in cell population#

After the annotation of the cell-type populations, we can also evaluate if there are significant changes in these populations in the healthy and diseased condition using scanpro.

do.pl.cell_composition(
    adata,
    annot_key="annotation",
    condition_key="condition",
    batch_key="batch",
    transform="arcsin",  # Produce more accurate results for simulated data
    condition_order=["healthy", "disease"],
)

[INFO] Your data doesn't have replicates! Artificial replicates will be simulated to run scanpro.
[INFO] Simulation may take some minutes...
[INFO] Generating 3 replicates and running 100 simulations...
[INFO] Finished 100 simulations in 0.86 seconds
2026-03-05 17:11:35,142 - There are 3 populations with a significant change

../_images/2e835376c72443ade4261588c086ba382d59c68979591030248cd52c8c5f97f9.png

Cell populations with a significant change are connected by discontinued lines and the p-value is indicated in the legend. In this case, we see a significant change in B cells, Monocytes and NK cells.

Reclustering of a cell population#

If we are interested in specific states for a cell-type, we can also perform re-clustering. In this case, we are going to focus on the biggest cluster, the T cells.

tcell = do.tl.reclustering(
    adata,
    cluster_key="annotation",  # Metadata column with clusters
    batch_key="batch",  # Metadata column with batch information
    recluster_approach="cca5",  # Integration approach used
    use_clusters=["T_cells"],  # Cluster we want to re-cluster
    use_rep="X_CCA",  # representation to use
    get_subset=True,  # Get AnnData of T_cells re-clusters
    resolution=0.6,
)
do.pl.umap(tcell, "leiden")

2026-03-05 17:11:50,299 - Reclustering using CCA5 approach
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:01)
running Leiden clustering
    finished: found 4 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

../_images/16ba320a1a47d775182b8e8a846b492ead9b67fa17995b6bab6bc83d62266b22.png

<Figure size 320x400 with 0 Axes>

We identified 5 clusters, to evaluate if there are subtypes of T_cells we can identify the top markers for each cluster.

do.tl.rank_genes_groups(tcell, groupby="leiden", method="wilcoxon", tie_correct=True, pts=True)
table = do.get.dge_results(tcell)
table_filt = table[(table.log2fc > 0.25) & (table.padj < 0.05)]

for group in table_filt.group.unique():
    display(table_filt[table_filt.group == group].head(6))

ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:05)

	GeneName	statistic	log2fc	pvals	padj	pts_group	pts_ref
0	RPS3A	24.220013	0.578721	1.369331e-129	8.451970e-126	0.998967	1.000000
1	RPS13	23.757196	0.637562	9.258045e-125	4.285781e-121	0.998967	0.998788
2	RPL30	22.125813	0.443310	1.783986e-108	6.606813e-105	1.000000	1.000000
3	RPS23	21.858437	0.484084	6.461934e-106	1.994260e-102	1.000000	0.998788
4	RPL32	21.758680	0.449734	5.717034e-105	1.512319e-101	1.000000	1.000000
5	RPS27A	21.066700	0.436605	1.607524e-98	3.720816e-95	1.000000	0.998788

	group	GeneName	statistic	log2fc	pvals	padj	pts_group	pts_ref
18517	1	LINC02446	35.053997	7.035328	3.388997e-269	6.275405e-265	0.813333	0.005821
18518	1	CD8B	32.606480	6.153499	3.319344e-233	3.073215e-229	0.906667	0.020373
18519	1	CD8A	29.618280	5.450690	8.691222e-193	5.364512e-189	0.760000	0.016298
18520	1	CD8B2	14.531270	6.848237	7.678328e-48	3.554490e-44	0.146667	0.001164
18521	1	CTSW	13.804950	2.810720	2.379395e-43	8.811850e-40	0.666667	0.119907
18522	1	S100B	10.849152	5.708410	2.012853e-27	6.193968e-24	0.120000	0.003492

	group	GeneName	statistic	log2fc	pvals	padj	pts_group	pts_ref
37034	2	ANXA1	21.476015	2.394850	2.609595e-102	4.832187e-98	0.748252	0.250614
37035	2	S100A4	20.369370	1.968216	3.126732e-92	2.894885e-88	0.865385	0.505324
37036	2	B2M	19.312704	0.441488	4.199950e-83	2.592349e-79	1.000000	1.000000
37037	2	S100A11	19.250797	1.834325	1.390048e-82	6.434877e-79	0.723776	0.284193
37038	2	ITGB1	18.959566	2.128073	3.681748e-80	1.363499e-76	0.596154	0.158067
37039	2	ANXA2	17.910473	2.517298	9.770188e-72	2.261432e-68	0.423077	0.066339

	group	GeneName	statistic	log2fc	pvals	padj	pts_group	pts_ref
55551	3	IKZF2	20.558655	3.792305	6.439410e-94	1.192386e-89	0.477528	0.037152
55552	3	RTKN2	18.475254	3.668481	3.266805e-76	3.024572e-72	0.455056	0.047059
55553	3	TIGIT	17.810202	3.052925	5.889649e-71	3.635287e-67	0.533708	0.076780
55554	3	CTLA4	17.165768	2.865431	4.791085e-66	2.217913e-62	0.528090	0.081734
55555	3	FOXP3	16.695776	4.500996	1.406768e-62	5.209826e-59	0.280899	0.016099
55556	3	TNFRSF9	15.840305	6.292408	1.640226e-56	4.338867e-53	0.191011	0.004334

From the list of markers, cluster 3 seems to express markers for T regulatory cells. while cluster 1 seems to be enriched in cytotoxic markers. We can visualise the distribution of these genes.

do.pl.umap(tcell, ["FOXP3", "CTLA4", "CD8A", "GZMK"], ncols=2, labels="leiden")

../_images/1012ff56079b6853a38b218d13fe75f82858044c6e8e994cbec1f0c85deb2400.png

<Figure size 320x400 with 0 Axes>

From this list of markers, we can see that cluster 1 is enriched for cytotoxic markers. We can transfer this annotation to our original object and evaluate again changes in the cell population.

tcell.obs["annotation_recluster"] = tcell.obs.leiden.map(
    {"0": "T_cells", "1": "T_cytotoxic", "2": "T_cells", "3": "Tregs", "4": "T_cells"}
)
adata.obs["annotation_recluster"] = adata.obs["annotation"].copy()
do.utility.transfer_labels(
    adata_original=adata,
    adata_subset=tcell,
    original_key="annotation_recluster",
    subset_key="annotation_recluster",
    original_labels=["T_cells"],
)
do.pl.umap(adata, "annotation_recluster", labels="annotation_recluster")

../_images/a862124f941f1bd2ca2883a1be4d5d6fde158d70cc51f166cd0c0f7dd150db6a.png

<Figure size 320x400 with 0 Axes>

do.pl.cell_composition(
    adata,
    annot_key="annotation_recluster",
    condition_key="condition",
    batch_key="batch",
    transform="arcsin",
    condition_order=["healthy", "disease"],
)

[INFO] Your data doesn't have replicates! Artificial replicates will be simulated to run scanpro.
[INFO] Simulation may take some minutes...
[INFO] Generating 3 replicates and running 100 simulations...
[INFO] Finished 100 simulations in 1.07 seconds
2026-03-05 17:12:11,441 - There are 5 populations with a significant change

../_images/bf8e0e448bafe561fafba93f055353108e19d806d08e90b019bd433820b92aee.png

We can see that even though there is a decrease in the proportion of T_cytotoxic, the change is not significant. On the other hand, the regulatory T cells increase significantly.

Gene Ontology analysis#

We can also evaluate which biological processes are enriched in a cell-type in each condition by performing gene ontology analysis. First, we need to identified differentially expressed genes. We are going to focus on T cells. We can use do.tl.go_analysis() to run gene set analysis using the enrichR API. This function, will split differentially express genes in up- and down-regulated and run the analysis for each set.

tcell = adata[adata.obs.annotation == "T_cells"]
do.tl.rank_genes_groups(
    tcell, groupby="condition", method="wilcoxon", tie_correct=True, pts=True, reference="healthy", groups=["disease"]
)
table = do.get.dge_results(tcell)
df = do.tl.go_analysis(
    table,
    gene_key="GeneName",
    pval_key="padj",
    log2fc_key="log2fc",
    log2fc_cutoff=0.25,  # It will take -0.25 and +0.25
    specie="Human",
    go_catgs=["GO_Biological_Process_2023"],
)
df.head(10)

ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
2026-03-05 17:12:40,599 - Running GSA on Up- and Down-regulated genes

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes	state
0	GO_Biological_Process_2023	Regulation Of Apoptotic Process (GO:0042981)	116/705	1.028501e-11	4.284737e-08	2.147309	54.327632	TFRC;ARL6IP1;CIB1;FAIM2;TNF;IKZF3;CCND2;EPC1;P...	enriched
1	GO_Biological_Process_2023	Positive Regulation Of Cytokine Production (GO...	66/320	2.493362e-11	5.193672e-08	2.800421	68.371737	IL21;ITK;HILPDA;RORA;PTPN22;TNF;IFIH1;PNP;PDE4...	enriched
2	GO_Biological_Process_2023	Regulation Of Gene Expression (GO:0010468)	161/1127	1.145223e-10	1.590333e-07	1.829209	41.871053	ZNF331;TFRC;NAB1;NAB2;JMJD1C;RORA;PRDM2;AHR;NR...	enriched
3	GO_Biological_Process_2023	Positive Regulation Of Apoptotic Process (GO:0...	57/270	2.220200e-10	2.312339e-07	2.875168	63.909956	TOP2A;PRR7;BTG1;CTSV;TNF;ADAMTSL4;CTSL;CASP3;P...	enriched
4	GO_Biological_Process_2023	Regulation Of Cytokine Production (GO:0001817)	40/167	2.438576e-09	2.031822e-06	3.366014	66.754292	IL21;PTGER4;ITK;HILPDA;ZBTB1;TNF;HDAC9;IFIH1;I...	enriched
5	GO_Biological_Process_2023	Regulation Of DNA-templated Transcription (GO:...	239/1922	3.152261e-09	2.188720e-06	1.571761	30.767450	ZNF296;JMJD1C;IKZF2;IKZF3;BACH1;IKZF5;SPIB;GPB...	enriched
6	GO_Biological_Process_2023	Regulation Of B Cell Proliferation (GO:0030888)	18/44	8.281962e-09	4.312832e-06	7.344744	136.679712	IL21;IL10;LYN;VAV3;CD74;MEF2C;TFRC;TNFRSF13B;F...	enriched
7	GO_Biological_Process_2023	Response To Unfolded Protein (GO:0006986)	18/44	8.281962e-09	4.312832e-06	7.344744	136.679712	HSPA8;PTPN1;HSP90AA1;HSP90AB1;HSPA4;RHBDD1;HSP...	enriched
8	GO_Biological_Process_2023	Negative Regulation Of Apoptotic Process (GO:0...	80/482	1.196966e-08	5.540624e-06	2.145098	39.128503	ARF4;TFRC;ARL6IP1;CITED2;CIB1;FAIM2;MTRNR2L8;T...	enriched
9	GO_Biological_Process_2023	Negative Regulation Of DNA-templated Transcrip...	140/1025	3.648046e-08	1.474492e-05	1.721391	29.481392	NAB1;GMNN;NAB2;ZBTB21;UBE2D1;HNRNPU;PRDM2;ARID...	enriched

We can visualise the top terms enriched in each condition with do.pl.split_bar_gsea(). But we need to do a pre-filtering to only consider significant terms.

df_filt = df[df["Adjusted P-value"] < 0.05]
do.pl.split_bar_gsea(
    df_filt,
    term_col="Term",
    col_split="Combined Score",  # Column to use for the x-axis
    cond_col="state",  # Column that splits the up and down-regulated terms
    pos_cond="enriched",  # value in cond_col that should be in the positive axis
)

2026-03-05 17:13:00,146 - !!! Assuming GO Terms are preprocessed (Only Significant terms included)

../_images/0e4283ddf5c007152bb1209fee5d0badd67c018eb9e07b875f96e153b8ff5fde.png

adata.write("/Users/david/Downloads/Data10x/adata.h5ad")

session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)

Click to view session information

-----
anndata             0.11.4
dotools_py          0.0.1
pandas              2.3.2
session_info        v1.0.1
-----

Click to view modules imported as dependencies

Cython                      3.1.4
PIL                         11.3.0
adjustText                  1.3.0
appnope                     0.1.4
argparse                    1.1
array_api_compat            1.12.0
arrow                       1.3.0
attr                        25.3.0
attrs                       25.3.0
babel                       2.17.0
beartype                    0.22.8
celltypist                  1.7.1
certifi                     2025.08.03
cffi                        2.0.0
charset_normalizer          3.4.3
cloudpickle                 3.1.1
comm                        0.2.3
coverage                    7.11.0
csv                         1.0
ctypes                      1.1.0
cycler                      0.12.1
cython                      3.1.4
dask                        2024.11.2
dateutil                    2.9.0.post0
debugpy                     1.8.17
decimal                     1.70
decorator                   5.2.1
defusedxml                  0.7.1
deprecated                  1.2.18
et_xmlfile                  2.0.0
executing                   2.2.1
fsspec                      2025.9.0
geopandas                   1.1.1
gseapy                      1.1.10
h5py                        3.14.0
idna                        3.10
igraph                      0.11.9
ipaddress                   1.0
ipykernel                   6.30.1
ipywidgets                  8.1.7
jedi                        0.19.2
jinja2                      3.1.6
joblib                      1.5.2
json                        2.0.9
json5                       0.12.1
jsonpointer                 3.0.0
jsonschema                  4.25.1
jupyter_events              0.12.0
jupyter_server              2.17.0
jupyterlab_server           2.27.3
kiwisolver                  1.4.9
lark                        1.2.2
leidenalg                   0.10.2
llvmlite                    0.45.0
logging                     0.5.1.2
markupsafe                  3.0.2
marshal                     4
matplotlib                  3.10.6
matplotlib_inline           0.1.7
msgpack                     1.1.2
natsort                     8.4.0
nbformat                    5.10.4
numba                       0.62.0
numcodecs                   0.15.1
numpy                       2.3.3
openpyxl                    3.1.5
packaging                   25.0
parso                       0.8.5
patsy                       1.0.1
platform                    1.0.8
platformdirs                4.4.0
polars                      1.33.1
prompt_toolkit              3.0.52
psutil                      7.1.0
pure_eval                   0.2.3
pyarrow                     21.0.0
pycparser                   2.23
pydot                       4.0.1
pygments                    2.19.2
pynndescent                 0.5.13
pyparsing                   3.2.4
pyproj                      3.7.2
pytz                        2025.2
re                          2.2.1
requests                    2.32.5
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
scanpro                     0.4.0
scanpy                      1.11.4
scipy                       1.15.3
seaborn                     0.13.2
shapely                     2.1.2
simplejson                  3.20.2
six                         1.17.0
sklearn                     1.7.2
sniffio                     1.3.1
socketserver                0.4
sparse                      0.17.0
sqlite3                     2.6.0
stack_data                  0.6.3
statsmodels                 0.14.5
stdlib_list                 0.11.1
sys                         3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
tarfile                     0.9.0
texttable                   1.7.0
threadpoolctl               3.6.0
tlz                         1.0.0
toolz                       1.0.0
torch                       2.8.0
tornado                     6.5.2
tqdm                        4.67.1
traitlets                   5.14.3
umap                        0.5.9.post2
urllib3                     2.5.0
wcwidth                     0.2.13
websocket                   1.8.0
wrapt                       1.17.3
yaml                        6.0.2
zarr                        2.18.7
zlib                        1.0
zmq                         27.1.0

-----
IPython             9.5.0
jupyter_client      8.6.3
jupyter_core        5.8.1
jupyterlab          4.4.7
notebook            7.4.5
-----
Python 3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
macOS-26.3-arm64-arm-64bit
10 logical CPU cores, arm
-----
Session information updated at 2026-03-05 17:13