Quality control of Visium#
To perform quality control of visium we have the dotools_py.pp.importer_py() function. This function compiles different methods to process samples. We need to define a list of H5 files that have been generated from a mapping tool like SpaceRanger. Additionally, we need to provide the batch names for the samples, and additional metadata can be provided in the form of a dictionary.
The quality control involves filtering genes and cells:
Genes: are removed based on their expression levels. Genes expressed in low amount of cells are excluded. By default, we consider that a gene is excluded if it is expressed in less than 5 cells
Spots: are removed based on different parameters, including: number of genes, and number of UMI counts. For Visium, a permissive quality control is recommended, since the more strick we are the more information we lose
Number of genes: cells are removed by absolute number of genes. A lower and upper threshold can be set.
Number of UMI counts: cells can be removed using two approaches: absolute or quantiles. A lower and upper threshold can be set and a combination of both approaches can be used (e.g., an absolute lower threshold and filter cells on the upper quantile).
Contrary to sc/snRNA-seq we do not filter spots based on mitochondrial content, since these do not represent low quality cells in this platform. We might want to remove the genes for downstream analysis depending on the set-up. Additionally, doublets cannot be identified since each spot represent a region with more than 1 cell. Lastly, it is recommended to identify regions with high hemoglobin content, since it might represent contamination.
After the quality control, the individual samples will be combined into one AnnData object and log-normalization, scaling and highly variable genes will be calculated. To evaluate the quality control the distribution of total UMI, number of genes, mitochondrial genes and hemoglobin genes per cell will be plotted in a violin plot after the quality control. These plots will be saved in the folders containing the H5 files. Additionally, we also keep track on the number of spots and genes that have been removed in each quality control step.
First, we start setting up the environment and loading the required libraries
Environment setup#
import dotools_py as do
import pandas as pd
from IPython.display import display, SVG
import session_info
2026-03-06 09:53:33,821 - Jupyter enviroment detected. Using "inline" backend
To show how the quality control works, we are going to use one public dataset from 10X from the human heart.
do.dt.example_visium("/Users/david/Downloads/PublicVisium10x")
2026-03-03 13:41:48,639 - Downloading data to /Users/david/Downloads/PublicVisium10x
paths = [
"/Users/david/Downloads/PublicVisium10x/",
]
adata = do.pp.importer_py(
paths=paths,
ids=["batch1"],
metadata={"condition": ["healthy"], "tissue": ["heart"]}, # Additional metadata information
batch_key="batch", # Column in obs to save batch information
remove_doublets=False,
min_genes_in_cell=200,
min_cells_with_genes=5,
cut_mt=None,
n_reads=10_000,
min_counts=500, # Filter cells with less than 500 genes
max_counts=100_000,
normalisation_method="LogNormalisation",
metrics_patterns=["mt-", ("rps", "rpl"), "^hb.*-"],
metrics_names=["mt", "ribo", "hb"],
random_state=0,
technology="visium"
)
2026-03-06 09:53:33,839 - For Visium, removing doublets is not recommended, ignoring this step
2026-03-06 09:53:33,839 - For Visium, identifying hemoglobin genes is recommended, adding the pattern
2026-03-06 09:53:33,840 - QualityControl Plots will be saved in
/Users/david/Downloads/PublicVisium10x
reading /Users/david/Downloads/PublicVisium10x/filtered_feature_bc_matrix.h5
(0:00:00)
2026-03-06 09:53:34,437 - Remove cells with low number of genes
filtered out 11 cells that have less than 200 genes expressed
2026-03-06 09:53:34,492 - Remove genes lowly expressed
filtered out 19885 genes that are detected in less than 5 cells
2026-03-06 09:53:34,561 - Remove cells based on nUMI counts
filtered out 1 cells that have less than 500 counts
2026-03-06 09:53:35,130 - Concatenating samples
2026-03-06 09:53:35,150 - Normalisation of the expression
normalizing counts per cell
finished (0:00:00)
2026-03-06 09:53:35,192 - Finding Highly Variable Genes shared across samples
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
2026-03-06 09:53:35,289 - Run PCA
computing PCA
with n_comps=50
finished (0:00:00)
adata
AnnData object with n_obs × n_vars = 4235 × 16716
obs: 'in_tissue', 'array_row', 'array_col', 'batch', 'condition', 'tissue', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'n_genes', 'n_counts'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'spatial', 'log1p', 'hvg'
obsm: 'spatial', 'X_pca'
layers: 'counts', 'logcounts'
Evaluation of the preprocessing#
We can now check the quality control plots that were generated:
files = [
"/Users/david/Downloads/PublicVisium10x/260303_QC_Metricsbatch1.svg",
"/Users/david/Downloads/PublicVisium10x/Vln_PostQC_batch1.svg",
]
display(
SVG(files[0]),
SVG(files[1]),
)
We can observe that we have not remove almost any of spots. Besides these plots, we also have an ExcelSheet that kept track on the thresholds used during the quality control.
table = pd.read_excel("/Users/david/Downloads/PublicVisium10x/260303_Metrics_batch1.xlsx")
table
| QC_Step | nCells | nFeatures | Comments | |
|---|---|---|---|---|
| 0 | Input_Shape | 4247 | 36601 | NaN |
| 1 | Rm_Cells_lowGenes | 4236 | 36601 | Remove cells with <200 genes |
| 2 | Rm_Genes_lowCells | 4236 | 16716 | Remove genes express in less than 5 cells |
| 3 | Rm_Cells_nUMI_nGenes | 4235 | 16716 | Remove cells based on nUMI counts[Absolute (Mi... |
Since this a spatial transcriptomics dataset, we can also visualize the distribution of these quality control metrics.
Integration and clustering#
After the quality control, we can now proceed to the batch correction and integration of the samples. In this example, we only have 1 slide, so we do not need to perform batch correction and can run PCA directly. Following dimensionality reduction, we can identify niches.
do.tl.integrate_data(
adata,
batch_key="batch",
integration_method="pca",
resolution=0.75, # Resolution for leiden algorithm
technology="spatial",
)
2026-03-06 09:53:50,022 - Computing HVGs
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
computing PCA
with n_comps=50
finished (0:00:00)
2026-03-06 09:53:50,883 - Finding neighbors
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
Creating graph using `grid` coordinates and `None` transform and `1` libraries.
Adding `adata.obsp['spatial_connectivities']`
`adata.obsp['spatial_distances']`
`adata.uns['spatial_neighbors']`
Finish (0:00:00)
2026-03-06 09:54:00,461 - Run UMAP
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:03)
2026-03-06 09:54:04,249 - Clustering cells using Leiden (resolution 0.75)
Niche exploration#
We can now visualize the clusters:
do.pl.umap(adata, "leiden", labels="leiden", figsize=(6, 5))
do.pl.slides(adata, "leiden", figsize=(6, 5))
adata.write("/Users/david/Downloads/PublicVisium10x/adata_visium.h5ad")
session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)
Click to view session information
----- anndata 0.11.4 dotools_py 0.0.1 pandas 2.3.2 session_info v1.0.1 -----
Click to view modules imported as dependencies
Cython 3.1.4 PIL 11.3.0 adjustText 1.3.0 appnope 0.1.4 argparse 1.1 arrow 1.3.0 attr 25.3.0 attrs 25.3.0 babel 2.17.0 beartype 0.22.8 certifi 2025.08.03 cffi 2.0.0 charset_normalizer 3.4.3 cloudpickle 3.1.1 comm 0.2.3 coverage 7.11.0 csv 1.0 ctypes 1.1.0 cycler 0.12.1 cython 3.1.4 dask 2024.11.2 datashader 0.18.2 dateutil 2.9.0.post0 debugpy 1.8.17 decimal 1.70 decorator 5.2.1 defusedxml 0.7.1 deprecated 1.2.18 docrep 0.3.2 et_xmlfile 2.0.0 executing 2.2.1 fsspec 2025.9.0 geopandas 1.1.1 h5py 3.14.0 idna 3.10 igraph 0.11.9 imageio 2.37.0 ipaddress 1.0 ipykernel 6.30.1 ipywidgets 8.1.7 jedi 0.19.2 jinja2 3.1.6 joblib 1.5.2 json 2.0.9 json5 0.12.1 jsonpointer 3.0.0 jsonschema 4.25.1 jupyter_events 0.12.0 jupyter_server 2.17.0 jupyterlab_server 2.27.3 kiwisolver 1.4.9 lark 1.2.2 lazy_loader 0.4 leidenalg 0.10.2 llvmlite 0.45.0 logging 0.5.1.2 markupsafe 3.0.2 marshal 4 matplotlib 3.10.6 matplotlib_inline 0.1.7 matplotlib_scalebar 0.9.0 more_itertools 10.8.0 msgpack 1.1.2 multipledispatch 0.6.0 multiscale_spatial_image 2.0.3 natsort 8.4.0 nbformat 5.10.4 networkx 3.5 numba 0.62.0 numcodecs 0.15.1 numpy 2.3.3 openpyxl 3.1.5 packaging 25.0 parso 0.8.5 patsy 1.0.1 platform 1.0.8 platformdirs 4.4.0 polars 1.33.1 prompt_toolkit 3.0.52 psutil 7.1.0 pure_eval 0.2.3 pyarrow 21.0.0 pycparser 2.23 pyct 0.6.0 pydot 4.0.1 pygments 2.19.2 pynndescent 0.5.13 pyparsing 3.2.4 pyproj 3.7.2 pytz 2025.2 re 2.2.1 requests 2.32.5 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 scanpy 1.11.4 scipy 1.15.3 seaborn 0.13.2 shapely 2.1.2 simplejson 3.20.2 six 1.17.0 skimage 0.25.2 sklearn 1.7.2 sniffio 1.3.1 socketserver 0.4 sparse 0.17.0 spatial_image 1.2.3 spatialdata 0.5.0 spatialleiden 0.4.0 sqlite3 2.6.0 squidpy 1.6.5 stack_data 0.6.3 statsmodels 0.14.5 stdlib_list 0.11.1 sys 3.11.13 (main, Jun 5 2025, 08:21:08) [Clang 14.0.6 ] tarfile 0.9.0 texttable 1.7.0 threadpoolctl 3.6.0 tifffile 2025.9.20 tlz 1.0.0 toolz 1.0.0 torch 2.8.0 tornado 6.5.2 tqdm 4.67.1 traitlets 5.14.3 umap 0.5.9.post2 urllib3 2.5.0 validators 0.35.0 wcwidth 0.2.13 websocket 1.8.0 wrapt 1.17.3 xarray 2025.9.0 xarray_dataclass 3.0.0 xarray_schema 0.0.3 xrspatial 0.4.0 yaml 6.0.2 zarr 2.18.7 zlib 1.0 zmq 27.1.0
----- IPython 9.5.0 jupyter_client 8.6.3 jupyter_core 5.8.1 jupyterlab 4.4.7 notebook 7.4.5 ----- Python 3.11.13 (main, Jun 5 2025, 08:21:08) [Clang 14.0.6 ] macOS-26.3-arm64-arm-64bit 10 logical CPU cores, arm ----- Session information updated at 2026-03-06 10:44