Differential gene expression analysis#
To perform differential gene expression analysis we have several alternatives and modes:
Single-cell level
Pseudobulk level
The dotools_py package includes two functions to automatically test for DEA between two conditions
for all the celltypes we have defined in our object, as well a consensus function to run both approaches.
Environment setup#
import anndata as ad
import dotools_py as do
import session_info
adata = ad.read_h5ad("/Users/david/Downloads/Data10x/adata.h5ad")
adata
2025-10-22 16:33:19,274 - Jupyter enviroment detected. Using "inline" backend
AnnData object with n_obs × n_vars = 2783 × 18517
obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap'
obsm: 'X_CCA', 'X_pca', 'X_umap'
layers: 'counts', 'logcounts'
obsp: 'connectivities', 'distances'
DEA at the single-cell level#
Among the test we can use we have: wilcoxon, t-test, logistic regression, t-test with overestimation of the variance and the MAST test.
The MAST test can be run directly using the do.tl.run_mast, the other tests can be run with do.tl.rank_genes_groups. Alternatively, we can use do.tl.rank_genes_condition, to use any of the test and automatically test for all the cell-types
To reduce the computation time, we are going to only use the NK cells
nk = adata[adata.obs.annotation == "NK"].copy()
df = do.tl.rank_genes_condition(
nk,
groupby="condition",
subset_by="annotation",
reference="healthy",
groups=["disease"],
method="mast",
get_results=True,
)
2025-10-22 16:34:01,821 - Running DGEs for NK.
2025-10-22 16:34:01,824 - Running MAST test
2025-10-22 16:34:12,968 - Reference level is set to healthy
df.head(10)
| GeneName | log2fc | pvals | padj | pts_group | pts_ref | group | annotation | |
|---|---|---|---|---|---|---|---|---|
| 0 | A1BG | 2.604576 | 0.001125 | 0.021286 | 0.200000 | 0.042553 | disease | NK |
| 1 | A1BG-AS1 | -22.228531 | 0.781334 | 1.000000 | 0.000000 | 0.002660 | disease | NK |
| 2 | A2M | -24.127653 | 0.639331 | 1.000000 | 0.000000 | 0.005319 | disease | NK |
| 3 | A2M-AS1 | -0.216740 | 0.636112 | 1.000000 | 0.133333 | 0.170213 | disease | NK |
| 4 | A4GALT | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | disease | NK |
| 5 | AAAS | -1.100035 | 0.694980 | 1.000000 | 0.022222 | 0.047872 | disease | NK |
| 6 | AACS | -25.872604 | 0.328578 | 0.901818 | 0.000000 | 0.015957 | disease | NK |
| 7 | AAED1 | 0.645098 | 0.001607 | 0.028256 | 0.044444 | 0.090426 | disease | NK |
| 8 | AAGAB | 0.570027 | 0.643581 | 1.000000 | 0.088889 | 0.066489 | disease | NK |
| 9 | AAK1 | -0.717432 | 0.129302 | 0.568715 | 0.333333 | 0.465426 | disease | NK |
DEA at the pseudobulk level#
To perform differential gene expression using a pseudobulk approach we can use do.tl.rank_genes_pseudobulk, which test between two conditions for each cell-type. We can use DESEq2 or edgeR. In this case we need to generate pseudo-replicates since we only have one sample per condition.
df = do.tl.rank_genes_pseudobulk(
adata,
ctrl_cond="healthy",
disease_cond="disease",
cluster_key="annotation",
batch_key="batch",
condition_key="condition",
design="~condition",
min_cells=30,
min_counts=10,
method="deseq2",
pseudobulk_approach="sum",
technical_replicates=2,
get_results=True,
)
2025-10-22 16:36:05,602 - Generating Pseudo-bulk data
2025-10-22 16:36:23,693 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:23,696 - Run DESeq2
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
baseMean log2FoldChange lfcSE stat pvalue padj
A1BG 5.669955 0.403825 1.067884 0.378154 0.705316 NaN
A1BG-AS1 0.738919 2.278757 3.406331 0.668977 0.503510 NaN
A2M-AS1 0.157668 0.021039 5.332329 0.003946 0.996852 NaN
A4GALT 2.649744 4.131701 2.716025 1.521231 0.128202 NaN
AAAS 1.998178 -0.768574 1.708237 -0.449923 0.652766 NaN
... ... ... ... ... ... ...
ZXDB 1.219480 -4.522738 3.132528 -1.443798 0.148796 NaN
ZXDC 3.052221 -0.368758 1.482192 -0.248793 0.803521 NaN
ZYG11B 2.702957 1.185403 1.593806 0.743756 0.457024 NaN
ZYX 4.646288 0.371801 1.177736 0.315692 0.752237 NaN
ZZEF1 2.725794 -0.534042 1.487064 -0.359125 0.719501 NaN
[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet initialization
2025-10-22 16:36:27,369 - Test could not be computed for Monocytes due to You specified a non-existant category for condition. Possible categories: healthy
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
baseMean log2FoldChange lfcSE stat pvalue padj
A1BG 9.584556 2.614643 1.247043 2.096674 0.036022 0.102957
A1BG-AS1 0.083831 1.607201 5.343868 0.300756 0.763600 NaN
A2M-AS1 10.868651 -0.295972 0.595112 -0.497339 0.618950 0.740909
A4GALT 0.000000 NaN NaN NaN NaN NaN
AAAS 2.234670 -1.123438 1.523501 -0.737405 0.460876 NaN
... ... ... ... ... ... ...
ZXDB 0.337953 -0.396078 2.704777 -0.146436 0.883577 NaN
ZXDC 1.524730 -2.568738 2.527960 -1.016131 0.309567 NaN
ZYG11B 0.252808 0.021401 2.999455 0.007135 0.994307 NaN
ZYX 11.933621 -1.723495 0.810586 -2.126234 0.033484 0.098208
ZZEF1 3.545880 -3.787932 2.403851 -1.575777 0.115077 NaN
[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
baseMean log2FoldChange lfcSE stat pvalue padj
A1BG 41.778608 -0.274318 0.380225 -0.721461 0.470626 0.670616
A1BG-AS1 4.012429 -0.926947 1.281293 -0.723446 0.469406 0.669769
A2M-AS1 4.688797 -1.240566 1.236603 -1.003204 0.315762 0.530839
A4GALT 0.000000 NaN NaN NaN NaN NaN
AAAS 15.380437 -1.665547 0.709661 -2.346961 0.018927 0.073339
... ... ... ... ... ... ...
ZXDB 5.162987 1.357682 1.049996 1.293036 0.195999 0.389684
ZXDC 18.622773 -0.592434 0.604882 -0.979422 0.327372 0.542457
ZYG11B 18.725857 -0.319720 0.563125 -0.567761 0.570197 0.748763
ZYX 59.401751 -0.652206 0.324146 -2.012075 0.044212 0.138577
ZZEF1 23.713091 -0.187387 0.493061 -0.380049 0.703909 0.841674
[12711 rows x 6 columns]
df.head(10)
| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | group | |
|---|---|---|---|---|---|---|---|
| A1BG | 5.669955 | 0.403825 | 1.067884 | 0.378154 | 0.705316 | 1.000000 | B_cells |
| A1BG-AS1 | 0.738919 | 2.278757 | 3.406331 | 0.668977 | 0.503510 | 1.000000 | B_cells |
| A2M-AS1 | 0.157668 | 0.021039 | 5.332329 | 0.003946 | 0.996852 | 1.000000 | B_cells |
| A4GALT | 2.649744 | 4.131701 | 2.716025 | 1.521231 | 0.128202 | 1.000000 | B_cells |
| AAAS | 1.998178 | -0.768574 | 1.708237 | -0.449923 | 0.652766 | 1.000000 | B_cells |
| AACS | 1.038974 | 0.518157 | 2.502786 | 0.207032 | 0.835985 | 1.000000 | B_cells |
| AAED1 | 9.829811 | 1.880914 | 0.951986 | 1.975780 | 0.048180 | 0.163028 | B_cells |
| AAGAB | 2.013459 | 0.515767 | 1.758700 | 0.293266 | 0.769319 | 1.000000 | B_cells |
| AAK1 | 0.816817 | 0.059553 | 2.702106 | 0.022039 | 0.982417 | 1.000000 | B_cells |
| AAMDC | 1.314319 | -0.937528 | 2.182265 | -0.429613 | 0.667478 | 1.000000 | B_cells |
DEA consensus#
Additionally, the do.tl.rank_genes_consensus allow to perform both single-cell and pseudo-bulk DEA and generate a dataframe that summarises everything.
df = do.tl.rank_genes_consensus(
adata,
ctrl_cond="healthy",
disease_cond="disease",
cluster_key="annotation",
batch_key="batch",
condition_key="condition",
min_cells=30,
min_counts=10,
pseudobulk_approach="sum",
technical_replicates=2,
get_results=True,
test_pseudobulk="edger",
test="wilcoxon",
)
2025-10-22 16:36:53,084 - Running wilcoxon
2025-10-22 16:36:53,151 - Running DGEs for B_cells.
2025-10-22 16:36:53,153 - Running wilcoxon test.
2025-10-22 16:36:53,480 - Running DGEs for Monocytes.
2025-10-22 16:36:53,482 - Running wilcoxon test.
2025-10-22 16:36:53,565 - Running DGEs for NK.
2025-10-22 16:36:53,566 - Running wilcoxon test.
2025-10-22 16:36:53,677 - Running DGEs for T_cells.
2025-10-22 16:36:53,680 - Running wilcoxon test.
2025-10-22 16:36:54,161 - Running DGEs for pDC.
2025-10-22 16:36:54,163 - Running wilcoxon test.
2025-10-22 16:36:54,197 - Running edger
2025-10-22 16:36:54,197 - Generating Pseudo-bulk data
2025-10-22 16:36:57,684 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:57,699 - Run edgeR
2025-10-22 16:36:57,703 - Running DEA for B_cells
2025-10-22 16:37:08,520 - Running DEA for Monocytes
2025-10-22 16:37:18,426 - Test could not be computed for Monocytes due to [Errno 2] No such file or directory: '/tmp/EdgeR_Test_eb8e58c43d00401db58be2acaae8a194/dge_Monocytes_edgeR.csv'
2025-10-22 16:37:18,427 - Running DEA for NK
2025-10-22 16:37:28,661 - Running DEA for T_cells
2025-10-22 16:37:38,776 - Generating consensus DataFrame
df.head(10)
| GeneName | wilcox_score | log2fc | pvals | padj | pts_group | pts_ref | group | annotation | log2fc_edger | stat_edger | pval_edger | padj_edger | sc_signicant | psc_signicant | consensus_significant | MeanExpr_disease | MeanExpr_healthy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | JUND | 10.587980 | 2.167520 | 3.388231e-26 | 1.254797e-22 | 0.982558 | 0.941176 | disease | B_cells | 2.150242 | 266.374184 | 1.508208e-09 | 5.454230e-07 | Yes | Yes | Yes | 4.278455 | 2.823328 |
| 1 | CD83 | 10.266931 | 4.102563 | 9.931388e-25 | 3.064992e-21 | 0.895349 | 0.176471 | disease | B_cells | 3.985247 | 289.033574 | 9.565300e-10 | 3.818415e-07 | Yes | Yes | Yes | 2.662119 | 0.574226 |
| 2 | RPS4Y1 | 10.197039 | 32.221443 | 2.044093e-24 | 5.407211e-21 | 0.813953 | 0.000000 | disease | B_cells | 9.754567 | 211.691989 | 1.603596e-08 | 3.189553e-06 | Yes | Yes | Yes | 1.793008 | 0.000000 |
| 3 | FOS | 10.120420 | 4.805681 | 4.484900e-24 | 1.038086e-20 | 0.912791 | 0.308824 | disease | B_cells | 4.670979 | 681.658744 | 6.302140e-12 | 1.253496e-08 | Yes | Yes | Yes | 3.868105 | 0.984030 |
| 4 | HSP90AA1 | 9.908383 | 2.990328 | 3.827898e-23 | 7.875687e-20 | 0.959302 | 0.632353 | disease | B_cells | 3.064642 | 316.093902 | 5.622099e-10 | 3.621600e-07 | Yes | Yes | Yes | 3.364896 | 1.507360 |
| 5 | CREM | 9.665054 | 5.431766 | 4.243946e-22 | 7.144105e-19 | 0.802326 | 0.058824 | disease | B_cells | 5.363856 | 299.596476 | 7.086365e-10 | 3.621600e-07 | Yes | Yes | Yes | 2.367836 | 0.202228 |
| 6 | YPEL5 | 9.301365 | 3.094231 | 1.386545e-20 | 1.974973e-17 | 0.889535 | 0.323529 | disease | B_cells | 2.976510 | 147.040320 | 4.389385e-08 | 6.073501e-06 | Yes | Yes | Yes | 2.494147 | 0.833389 |
| 7 | SRGN | 9.120118 | 4.315721 | 7.504225e-20 | 8.684733e-17 | 0.802326 | 0.161765 | disease | B_cells | 4.067877 | 206.063003 | 6.646860e-09 | 1.652575e-06 | Yes | Yes | Yes | 2.626802 | 0.497278 |
| 8 | TUBB4B | 8.940316 | 3.670844 | 3.880572e-19 | 3.992031e-16 | 0.784884 | 0.147059 | disease | B_cells | 3.703473 | 165.803423 | 2.310259e-08 | 4.177369e-06 | Yes | Yes | Yes | 1.940737 | 0.384084 |
| 9 | RGS1 | 8.825177 | 7.311319 | 1.092864e-18 | 9.233199e-16 | 0.697674 | 0.029412 | disease | B_cells | 7.306141 | 567.203038 | 4.569613e-12 | 1.253496e-08 | Yes | Yes | Yes | 2.666053 | 0.080899 |
adata
AnnData object with n_obs × n_vars = 2783 × 18517
obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap', 'rank_genes_pseudobulk', 'rank_genes_condition', 'rank_genes_consensus'
obsm: 'X_CCA', 'X_pca', 'X_umap'
layers: 'counts', 'logcounts'
obsp: 'connectivities', 'distances'
As we can appreciate, the results of the DEA will be saved in the uns attributed
session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)
Click to view session information
----- anndata 0.11.4 dotools_py 0.0.1 pandas 2.3.2 session_info v1.0.1 sys 3.11.13 (main, Jun 5 2025, 08:21:08) [Clang 14.0.6 ] -----
Click to view modules imported as dependencies
Cython 3.1.4 PIL 11.3.0 absl 2.3.1 adjustText 1.3.0 appnope 0.1.4 argparse 1.1 arrow 1.3.0 attr 25.3.0 attrs 25.3.0 babel 2.17.0 certifi 2025.08.03 cffi 2.0.0 charset_normalizer 3.4.3 cloudpickle 3.1.1 comm 0.2.3 coverage 7.11.0 csv 1.0 ctypes 1.1.0 cycler 0.12.1 cython 3.1.4 dask 2024.11.2 dateutil 2.9.0.post0 debugpy 1.8.17 decimal 1.70 decorator 5.2.1 defusedxml 0.7.1 deprecated 1.2.18 docrep 0.3.2 executing 2.2.1 formulaic 1.2.0 formulaic_contrasts 1.0.0 fsspec 2025.9.0 geopandas 1.1.1 h5py 3.14.0 idna 3.10 igraph 0.11.9 interface_meta 1.3.0 ipaddress 1.0 ipykernel 6.30.1 ipywidgets 8.1.7 jedi 0.19.2 jinja2 3.1.6 joblib 1.5.2 json 2.0.9 json5 0.12.1 jsonpointer 3.0.0 jsonschema 4.25.1 jupyter_events 0.12.0 jupyter_server 2.17.0 jupyterlab_server 2.27.3 kiwisolver 1.4.9 lark 1.2.2 leidenalg 0.10.2 lightning 2.5.5 lightning_utilities 0.15.2 llvmlite 0.45.0 logging 0.5.1.2 markupsafe 3.0.2 marshal 4 matplotlib 3.10.6 matplotlib_inline 0.1.7 ml_collections 1.1.0 mpmath 1.3.0 mudata 0.3.2 narwhals 2.5.0 natsort 8.4.0 nbformat 5.10.4 numba 0.62.0 numcodecs 0.15.1 numpy 2.3.3 opt_einsum 3.4.0 packaging 25.0 parso 0.8.5 patsy 1.0.1 platform 1.0.8 platformdirs 4.4.0 polars 1.33.1 prompt_toolkit 3.0.52 psutil 7.1.0 pure_eval 0.2.3 pyarrow 21.0.0 pycparser 2.23 pydeseq2 0.5.2 pygments 2.19.2 pyparsing 3.2.4 pyproj 3.7.2 pyro 1.9.1 pytz 2025.2 re 2.2.1 requests 2.32.5 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 scanpy 1.11.3 scipy 1.15.3 scvi 1.4.0 seaborn 0.13.2 shapely 2.1.2 six 1.17.0 sklearn 1.7.2 sniffio 1.3.1 socketserver 0.4 sparse 0.17.0 sqlite3 2.6.0 stack_data 0.6.3 statsmodels 0.14.5 stdlib_list 0.11.1 sympy 1.14.0 tarfile 0.9.0 texttable 1.7.0 threadpoolctl 3.6.0 tlz 1.0.0 toolz 1.0.0 torch 2.8.0 torchmetrics 1.8.2 tornado 6.5.2 tqdm 4.67.1 traitlets 5.14.3 urllib3 2.5.0 wcwidth 0.2.13 websocket 1.8.0 wrapt 1.17.3 xarray 2025.9.0 yaml 6.0.2 zarr 2.18.7 zlib 1.0 zmq 27.1.0
----- IPython 9.5.0 jupyter_client 8.6.3 jupyter_core 5.8.1 jupyterlab 4.4.7 notebook 7.4.5 ----- Python 3.11.13 (main, Jun 5 2025, 08:21:08) [Clang 14.0.6 ] macOS-26.0.1-arm64-arm-64bit 10 logical CPU cores, arm ----- Session information updated at 2025-10-22 16:38