Quality control of Visium

Quality control of Visium#

To perform quality control of visium we have the dotools_py.pp.importer_py() function. This function compiles different methods to process samples. We need to define a list of H5 files that have been generated from a mapping tool like SpaceRanger. Additionally, we need to provide the batch names for the samples, and additional metadata can be provided in the form of a dictionary.

The quality control involves filtering genes and cells:

Genes: are removed based on their expression levels. Genes expressed in low amount of cells are excluded. By default, we consider that a gene is excluded if it is expressed in less than 5 cells
Spots: are removed based on different parameters, including: number of genes, and number of UMI counts. For Visium, a permissive quality control is recommended, since the more strick we are the more information we lose
- Number of genes: cells are removed by absolute number of genes. A lower and upper threshold can be set.
- Number of UMI counts: cells can be removed using two approaches: absolute or quantiles. A lower and upper threshold can be set and a combination of both approaches can be used (e.g., an absolute lower threshold and filter cells on the upper quantile).

Contrary to sc/snRNA-seq we do not filter spots based on mitochondrial content, since these do not represent low quality cells in this platform. We might want to remove the genes for downstream analysis depending on the set-up. Additionally, doublets cannot be identified since each spot represent a region with more than 1 cell. Lastly, it is recommended to identify regions with high hemoglobin content, since it might represent contamination.

After the quality control, the individual samples will be combined into one AnnData object and log-normalization, scaling and highly variable genes will be calculated. To evaluate the quality control the distribution of total UMI, number of genes, mitochondrial genes and hemoglobin genes per cell will be plotted in a violin plot after the quality control. These plots will be saved in the folders containing the H5 files. Additionally, we also keep track on the number of spots and genes that have been removed in each quality control step.

First, we start setting up the environment and loading the required libraries

Environment setup#

import dotools_py as do
import pandas as pd
from IPython.display import display, SVG
import session_info

2026-03-06 09:53:33,821 - Jupyter enviroment detected. Using "inline" backend

To show how the quality control works, we are going to use one public dataset from 10X from the human heart.

do.dt.example_visium("/Users/david/Downloads/PublicVisium10x")

2026-03-03 13:41:48,639 - Downloading data to /Users/david/Downloads/PublicVisium10x

paths = [
    "/Users/david/Downloads/PublicVisium10x/",
]

adata = do.pp.importer_py(
    paths=paths,
    ids=["batch1"],
    metadata={"condition": ["healthy"], "tissue": ["heart"]},  # Additional metadata information
    batch_key="batch",  # Column in obs to save batch information
    remove_doublets=False,
    min_genes_in_cell=200,
    min_cells_with_genes=5,
    cut_mt=None,
    n_reads=10_000,
    min_counts=500,  # Filter cells with less than 500 genes
    max_counts=100_000,
    normalisation_method="LogNormalisation",
    metrics_patterns=["mt-", ("rps", "rpl"), "^hb.*-"],
    metrics_names=["mt", "ribo", "hb"],
    random_state=0,
    technology="visium"
)

2026-03-06 09:53:33,839 - For Visium, removing doublets is not recommended, ignoring this step
2026-03-06 09:53:33,839 - For Visium, identifying hemoglobin genes is recommended, adding the pattern
2026-03-06 09:53:33,840 - QualityControl Plots will be saved in
/Users/david/Downloads/PublicVisium10x
reading /Users/david/Downloads/PublicVisium10x/filtered_feature_bc_matrix.h5
 (0:00:00)
2026-03-06 09:53:34,437 - Remove cells with low number of genes
filtered out 11 cells that have less than 200 genes expressed
2026-03-06 09:53:34,492 - Remove genes lowly expressed
filtered out 19885 genes that are detected in less than 5 cells
2026-03-06 09:53:34,561 - Remove cells based on nUMI counts
filtered out 1 cells that have less than 500 counts
2026-03-06 09:53:35,130 - Concatenating samples
2026-03-06 09:53:35,150 - Normalisation of the expression
normalizing counts per cell
    finished (0:00:00)
2026-03-06 09:53:35,192 - Finding Highly Variable Genes shared across samples
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
2026-03-06 09:53:35,289 - Run PCA
computing PCA
    with n_comps=50
    finished (0:00:00)

adata

AnnData object with n_obs × n_vars = 4235 × 16716
    obs: 'in_tissue', 'array_row', 'array_col', 'batch', 'condition', 'tissue', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'n_genes', 'n_counts'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'spatial', 'log1p', 'hvg'
    obsm: 'spatial', 'X_pca'
    layers: 'counts', 'logcounts'

Evaluation of the preprocessing#

We can now check the quality control plots that were generated:

files = [
    "/Users/david/Downloads/PublicVisium10x/260303_QC_Metricsbatch1.svg",
    "/Users/david/Downloads/PublicVisium10x/Vln_PostQC_batch1.svg",
]
display(
    SVG(files[0]),
    SVG(files[1]),
)

../_images/73dee2d487fef9b849ef8cd8984a55bb5a08096a3dd9d991ddae385e2c8b06c3.svg

../_images/68f0449351f77d0dd0b9570e9ed4a99a0db3eb45cb7d10be8bb04888ef53fc42.svg

We can observe that we have not remove almost any of spots. Besides these plots, we also have an ExcelSheet that kept track on the thresholds used during the quality control.

table = pd.read_excel("/Users/david/Downloads/PublicVisium10x/260303_Metrics_batch1.xlsx")
table

	QC_Step	nCells	nFeatures	Comments
0	Input_Shape	4247	36601	NaN
1	Rm_Cells_lowGenes	4236	36601	Remove cells with <200 genes
2	Rm_Genes_lowCells	4236	16716	Remove genes express in less than 5 cells
3	Rm_Cells_nUMI_nGenes	4235	16716	Remove cells based on nUMI counts[Absolute (Mi...

Since this a spatial transcriptomics dataset, we can also visualize the distribution of these quality control metrics.

do.pl.slides(adata, ["total_counts", "n_genes_by_counts", "pct_counts_mt"], figsize=(12, 3), cmap="RdYlBu_r")

../_images/b84ca7fcd2a27f369ba8d5c702078cebcf2965a251a30f04f1e9799dd4737665.png

Integration and clustering#

After the quality control, we can now proceed to the batch correction and integration of the samples. In this example, we only have 1 slide, so we do not need to perform batch correction and can run PCA directly. Following dimensionality reduction, we can identify niches.

do.tl.integrate_data(
    adata,
    batch_key="batch",
    integration_method="pca",
    resolution=0.75,  # Resolution for leiden algorithm
    technology="spatial",
)

2026-03-06 09:53:50,022 - Computing HVGs
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
computing PCA
    with n_comps=50
    finished (0:00:00)
2026-03-06 09:53:50,883 - Finding neighbors
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
Creating graph using `grid` coordinates and `None` transform and `1` libraries.
Adding `adata.obsp['spatial_connectivities']`
       `adata.obsp['spatial_distances']`
       `adata.uns['spatial_neighbors']`
Finish (0:00:00)
2026-03-06 09:54:00,461 - Run UMAP
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:03)
2026-03-06 09:54:04,249 - Clustering cells using Leiden (resolution 0.75)

Niche exploration#

We can now visualize the clusters:

do.pl.umap(adata, "leiden", labels="leiden", figsize=(6, 5))
do.pl.slides(adata, "leiden", figsize=(6, 5))

../_images/7679f4e121438d2ca631e9f22b1dc68bf40179a711cc926c2f6e4618a8fdad18.png

<Figure size 320x400 with 0 Axes>

../_images/8740cb812d98c1f1512ba9870f70413a2d803520cef10491feb2a48b4baa5c46.png

adata.write("/Users/david/Downloads/PublicVisium10x/adata_visium.h5ad")

session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)

Click to view session information

-----
anndata             0.11.4
dotools_py          0.0.1
pandas              2.3.2
session_info        v1.0.1
-----

Click to view modules imported as dependencies

Cython                      3.1.4
PIL                         11.3.0
adjustText                  1.3.0
appnope                     0.1.4
argparse                    1.1
arrow                       1.3.0
attr                        25.3.0
attrs                       25.3.0
babel                       2.17.0
beartype                    0.22.8
certifi                     2025.08.03
cffi                        2.0.0
charset_normalizer          3.4.3
cloudpickle                 3.1.1
comm                        0.2.3
coverage                    7.11.0
csv                         1.0
ctypes                      1.1.0
cycler                      0.12.1
cython                      3.1.4
dask                        2024.11.2
datashader                  0.18.2
dateutil                    2.9.0.post0
debugpy                     1.8.17
decimal                     1.70
decorator                   5.2.1
defusedxml                  0.7.1
deprecated                  1.2.18
docrep                      0.3.2
et_xmlfile                  2.0.0
executing                   2.2.1
fsspec                      2025.9.0
geopandas                   1.1.1
h5py                        3.14.0
idna                        3.10
igraph                      0.11.9
imageio                     2.37.0
ipaddress                   1.0
ipykernel                   6.30.1
ipywidgets                  8.1.7
jedi                        0.19.2
jinja2                      3.1.6
joblib                      1.5.2
json                        2.0.9
json5                       0.12.1
jsonpointer                 3.0.0
jsonschema                  4.25.1
jupyter_events              0.12.0
jupyter_server              2.17.0
jupyterlab_server           2.27.3
kiwisolver                  1.4.9
lark                        1.2.2
lazy_loader                 0.4
leidenalg                   0.10.2
llvmlite                    0.45.0
logging                     0.5.1.2
markupsafe                  3.0.2
marshal                     4
matplotlib                  3.10.6
matplotlib_inline           0.1.7
matplotlib_scalebar         0.9.0
more_itertools              10.8.0
msgpack                     1.1.2
multipledispatch            0.6.0
multiscale_spatial_image    2.0.3
natsort                     8.4.0
nbformat                    5.10.4
networkx                    3.5
numba                       0.62.0
numcodecs                   0.15.1
numpy                       2.3.3
openpyxl                    3.1.5
packaging                   25.0
parso                       0.8.5
patsy                       1.0.1
platform                    1.0.8
platformdirs                4.4.0
polars                      1.33.1
prompt_toolkit              3.0.52
psutil                      7.1.0
pure_eval                   0.2.3
pyarrow                     21.0.0
pycparser                   2.23
pyct                        0.6.0
pydot                       4.0.1
pygments                    2.19.2
pynndescent                 0.5.13
pyparsing                   3.2.4
pyproj                      3.7.2
pytz                        2025.2
re                          2.2.1
requests                    2.32.5
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
scanpy                      1.11.4
scipy                       1.15.3
seaborn                     0.13.2
shapely                     2.1.2
simplejson                  3.20.2
six                         1.17.0
skimage                     0.25.2
sklearn                     1.7.2
sniffio                     1.3.1
socketserver                0.4
sparse                      0.17.0
spatial_image               1.2.3
spatialdata                 0.5.0
spatialleiden               0.4.0
sqlite3                     2.6.0
squidpy                     1.6.5
stack_data                  0.6.3
statsmodels                 0.14.5
stdlib_list                 0.11.1
sys                         3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
tarfile                     0.9.0
texttable                   1.7.0
threadpoolctl               3.6.0
tifffile                    2025.9.20
tlz                         1.0.0
toolz                       1.0.0
torch                       2.8.0
tornado                     6.5.2
tqdm                        4.67.1
traitlets                   5.14.3
umap                        0.5.9.post2
urllib3                     2.5.0
validators                  0.35.0
wcwidth                     0.2.13
websocket                   1.8.0
wrapt                       1.17.3
xarray                      2025.9.0
xarray_dataclass            3.0.0
xarray_schema               0.0.3
xrspatial                   0.4.0
yaml                        6.0.2
zarr                        2.18.7
zlib                        1.0
zmq                         27.1.0

-----
IPython             9.5.0
jupyter_client      8.6.3
jupyter_core        5.8.1
jupyterlab          4.4.7
notebook            7.4.5
-----
Python 3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
macOS-26.3-arm64-arm-64bit
10 logical CPU cores, arm
-----
Session information updated at 2026-03-06 10:44