Differential gene expression analysis#

To perform differential gene expression analysis we have several alternatives and modes:

  • Single-cell level

  • Pseudobulk level

The dotools_py package includes two functions to automatically test for DEA between two conditions for all the celltypes we have defined in our object, as well a consensus function to run both approaches.

Environment setup#

import anndata as ad
import dotools_py as do
import session_info


adata = ad.read_h5ad("/Users/david/Downloads/Data10x/adata.h5ad")
adata
2025-10-22 16:33:19,274 - Jupyter enviroment detected. Using "inline" backend
AnnData object with n_obs × n_vars = 2783 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap'
    obsm: 'X_CCA', 'X_pca', 'X_umap'
    layers: 'counts', 'logcounts'
    obsp: 'connectivities', 'distances'

DEA at the single-cell level#

Among the test we can use we have: wilcoxon, t-test, logistic regression, t-test with overestimation of the variance and the MAST test. The MAST test can be run directly using the do.tl.run_mast, the other tests can be run with do.tl.rank_genes_groups. Alternatively, we can use do.tl.rank_genes_condition, to use any of the test and automatically test for all the cell-types

To reduce the computation time, we are going to only use the NK cells

nk = adata[adata.obs.annotation == "NK"].copy()

df = do.tl.rank_genes_condition(
    nk,
    groupby="condition",
    subset_by="annotation",
    reference="healthy",
    groups=["disease"],
    method="mast",
    get_results=True,
)
2025-10-22 16:34:01,821 - Running DGEs for NK.
2025-10-22 16:34:01,824 - Running MAST test
2025-10-22 16:34:12,968 - Reference level is set to healthy
df.head(10)
GeneName log2fc pvals padj pts_group pts_ref group annotation
0 A1BG 2.604576 0.001125 0.021286 0.200000 0.042553 disease NK
1 A1BG-AS1 -22.228531 0.781334 1.000000 0.000000 0.002660 disease NK
2 A2M -24.127653 0.639331 1.000000 0.000000 0.005319 disease NK
3 A2M-AS1 -0.216740 0.636112 1.000000 0.133333 0.170213 disease NK
4 A4GALT 0.000000 1.000000 1.000000 0.000000 0.000000 disease NK
5 AAAS -1.100035 0.694980 1.000000 0.022222 0.047872 disease NK
6 AACS -25.872604 0.328578 0.901818 0.000000 0.015957 disease NK
7 AAED1 0.645098 0.001607 0.028256 0.044444 0.090426 disease NK
8 AAGAB 0.570027 0.643581 1.000000 0.088889 0.066489 disease NK
9 AAK1 -0.717432 0.129302 0.568715 0.333333 0.465426 disease NK

DEA at the pseudobulk level#

To perform differential gene expression using a pseudobulk approach we can use do.tl.rank_genes_pseudobulk, which test between two conditions for each cell-type. We can use DESEq2 or edgeR. In this case we need to generate pseudo-replicates since we only have one sample per condition.

df = do.tl.rank_genes_pseudobulk(
    adata,
    ctrl_cond="healthy",
    disease_cond="disease",
    cluster_key="annotation",
    batch_key="batch",
    condition_key="condition",
    design="~condition",
    min_cells=30,
    min_counts=10,
    method="deseq2",
    pseudobulk_approach="sum",
    technical_replicates=2,
    get_results=True,
)
2025-10-22 16:36:05,602 - Generating Pseudo-bulk data
2025-10-22 16:36:23,693 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:23,696 - Run DESeq2
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
          baseMean  log2FoldChange     lfcSE      stat    pvalue  padj
A1BG      5.669955        0.403825  1.067884  0.378154  0.705316   NaN
A1BG-AS1  0.738919        2.278757  3.406331  0.668977  0.503510   NaN
A2M-AS1   0.157668        0.021039  5.332329  0.003946  0.996852   NaN
A4GALT    2.649744        4.131701  2.716025  1.521231  0.128202   NaN
AAAS      1.998178       -0.768574  1.708237 -0.449923  0.652766   NaN
...            ...             ...       ...       ...       ...   ...
ZXDB      1.219480       -4.522738  3.132528 -1.443798  0.148796   NaN
ZXDC      3.052221       -0.368758  1.482192 -0.248793  0.803521   NaN
ZYG11B    2.702957        1.185403  1.593806  0.743756  0.457024   NaN
ZYX       4.646288        0.371801  1.177736  0.315692  0.752237   NaN
ZZEF1     2.725794       -0.534042  1.487064 -0.359125  0.719501   NaN

[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet initialization
2025-10-22 16:36:27,369 - Test could not be computed for Monocytes due to You specified a non-existant category for condition. Possible categories: healthy
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
           baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A1BG       9.584556        2.614643  1.247043  2.096674  0.036022  0.102957
A1BG-AS1   0.083831        1.607201  5.343868  0.300756  0.763600       NaN
A2M-AS1   10.868651       -0.295972  0.595112 -0.497339  0.618950  0.740909
A4GALT     0.000000             NaN       NaN       NaN       NaN       NaN
AAAS       2.234670       -1.123438  1.523501 -0.737405  0.460876       NaN
...             ...             ...       ...       ...       ...       ...
ZXDB       0.337953       -0.396078  2.704777 -0.146436  0.883577       NaN
ZXDC       1.524730       -2.568738  2.527960 -1.016131  0.309567       NaN
ZYG11B     0.252808        0.021401  2.999455  0.007135  0.994307       NaN
ZYX       11.933621       -1.723495  0.810586 -2.126234  0.033484  0.098208
ZZEF1      3.545880       -3.787932  2.403851 -1.575777  0.115077       NaN

[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
           baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A1BG      41.778608       -0.274318  0.380225 -0.721461  0.470626  0.670616
A1BG-AS1   4.012429       -0.926947  1.281293 -0.723446  0.469406  0.669769
A2M-AS1    4.688797       -1.240566  1.236603 -1.003204  0.315762  0.530839
A4GALT     0.000000             NaN       NaN       NaN       NaN       NaN
AAAS      15.380437       -1.665547  0.709661 -2.346961  0.018927  0.073339
...             ...             ...       ...       ...       ...       ...
ZXDB       5.162987        1.357682  1.049996  1.293036  0.195999  0.389684
ZXDC      18.622773       -0.592434  0.604882 -0.979422  0.327372  0.542457
ZYG11B    18.725857       -0.319720  0.563125 -0.567761  0.570197  0.748763
ZYX       59.401751       -0.652206  0.324146 -2.012075  0.044212  0.138577
ZZEF1     23.713091       -0.187387  0.493061 -0.380049  0.703909  0.841674

[12711 rows x 6 columns]
df.head(10)
baseMean log2FoldChange lfcSE stat pvalue padj group
A1BG 5.669955 0.403825 1.067884 0.378154 0.705316 1.000000 B_cells
A1BG-AS1 0.738919 2.278757 3.406331 0.668977 0.503510 1.000000 B_cells
A2M-AS1 0.157668 0.021039 5.332329 0.003946 0.996852 1.000000 B_cells
A4GALT 2.649744 4.131701 2.716025 1.521231 0.128202 1.000000 B_cells
AAAS 1.998178 -0.768574 1.708237 -0.449923 0.652766 1.000000 B_cells
AACS 1.038974 0.518157 2.502786 0.207032 0.835985 1.000000 B_cells
AAED1 9.829811 1.880914 0.951986 1.975780 0.048180 0.163028 B_cells
AAGAB 2.013459 0.515767 1.758700 0.293266 0.769319 1.000000 B_cells
AAK1 0.816817 0.059553 2.702106 0.022039 0.982417 1.000000 B_cells
AAMDC 1.314319 -0.937528 2.182265 -0.429613 0.667478 1.000000 B_cells

DEA consensus#

Additionally, the do.tl.rank_genes_consensus allow to perform both single-cell and pseudo-bulk DEA and generate a dataframe that summarises everything.

df = do.tl.rank_genes_consensus(
    adata,
    ctrl_cond="healthy",
    disease_cond="disease",
    cluster_key="annotation",
    batch_key="batch",
    condition_key="condition",
    min_cells=30,
    min_counts=10,
    pseudobulk_approach="sum",
    technical_replicates=2,
    get_results=True,
    test_pseudobulk="edger",
    test="wilcoxon",
)
2025-10-22 16:36:53,084 - Running wilcoxon
2025-10-22 16:36:53,151 - Running DGEs for B_cells.
2025-10-22 16:36:53,153 - Running wilcoxon test.
2025-10-22 16:36:53,480 - Running DGEs for Monocytes.
2025-10-22 16:36:53,482 - Running wilcoxon test.
2025-10-22 16:36:53,565 - Running DGEs for NK.
2025-10-22 16:36:53,566 - Running wilcoxon test.
2025-10-22 16:36:53,677 - Running DGEs for T_cells.
2025-10-22 16:36:53,680 - Running wilcoxon test.
2025-10-22 16:36:54,161 - Running DGEs for pDC.
2025-10-22 16:36:54,163 - Running wilcoxon test.
2025-10-22 16:36:54,197 - Running edger
2025-10-22 16:36:54,197 - Generating Pseudo-bulk data
2025-10-22 16:36:57,684 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:57,699 - Run edgeR
2025-10-22 16:36:57,703 - Running DEA for B_cells
2025-10-22 16:37:08,520 - Running DEA for Monocytes
2025-10-22 16:37:18,426 - Test could not be computed for Monocytes due to [Errno 2] No such file or directory: '/tmp/EdgeR_Test_eb8e58c43d00401db58be2acaae8a194/dge_Monocytes_edgeR.csv'
2025-10-22 16:37:18,427 - Running DEA for NK
2025-10-22 16:37:28,661 - Running DEA for T_cells
2025-10-22 16:37:38,776 - Generating consensus DataFrame
df.head(10)
GeneName wilcox_score log2fc pvals padj pts_group pts_ref group annotation log2fc_edger stat_edger pval_edger padj_edger sc_signicant psc_signicant consensus_significant MeanExpr_disease MeanExpr_healthy
0 JUND 10.587980 2.167520 3.388231e-26 1.254797e-22 0.982558 0.941176 disease B_cells 2.150242 266.374184 1.508208e-09 5.454230e-07 Yes Yes Yes 4.278455 2.823328
1 CD83 10.266931 4.102563 9.931388e-25 3.064992e-21 0.895349 0.176471 disease B_cells 3.985247 289.033574 9.565300e-10 3.818415e-07 Yes Yes Yes 2.662119 0.574226
2 RPS4Y1 10.197039 32.221443 2.044093e-24 5.407211e-21 0.813953 0.000000 disease B_cells 9.754567 211.691989 1.603596e-08 3.189553e-06 Yes Yes Yes 1.793008 0.000000
3 FOS 10.120420 4.805681 4.484900e-24 1.038086e-20 0.912791 0.308824 disease B_cells 4.670979 681.658744 6.302140e-12 1.253496e-08 Yes Yes Yes 3.868105 0.984030
4 HSP90AA1 9.908383 2.990328 3.827898e-23 7.875687e-20 0.959302 0.632353 disease B_cells 3.064642 316.093902 5.622099e-10 3.621600e-07 Yes Yes Yes 3.364896 1.507360
5 CREM 9.665054 5.431766 4.243946e-22 7.144105e-19 0.802326 0.058824 disease B_cells 5.363856 299.596476 7.086365e-10 3.621600e-07 Yes Yes Yes 2.367836 0.202228
6 YPEL5 9.301365 3.094231 1.386545e-20 1.974973e-17 0.889535 0.323529 disease B_cells 2.976510 147.040320 4.389385e-08 6.073501e-06 Yes Yes Yes 2.494147 0.833389
7 SRGN 9.120118 4.315721 7.504225e-20 8.684733e-17 0.802326 0.161765 disease B_cells 4.067877 206.063003 6.646860e-09 1.652575e-06 Yes Yes Yes 2.626802 0.497278
8 TUBB4B 8.940316 3.670844 3.880572e-19 3.992031e-16 0.784884 0.147059 disease B_cells 3.703473 165.803423 2.310259e-08 4.177369e-06 Yes Yes Yes 1.940737 0.384084
9 RGS1 8.825177 7.311319 1.092864e-18 9.233199e-16 0.697674 0.029412 disease B_cells 7.306141 567.203038 4.569613e-12 1.253496e-08 Yes Yes Yes 2.666053 0.080899
adata
AnnData object with n_obs × n_vars = 2783 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap', 'rank_genes_pseudobulk', 'rank_genes_condition', 'rank_genes_consensus'
    obsm: 'X_CCA', 'X_pca', 'X_umap'
    layers: 'counts', 'logcounts'
    obsp: 'connectivities', 'distances'

As we can appreciate, the results of the DEA will be saved in the uns attributed

session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)
Click to view session information
-----
anndata             0.11.4
dotools_py          0.0.1
pandas              2.3.2
session_info        v1.0.1
sys                 3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
-----
Click to view modules imported as dependencies
Cython                      3.1.4
PIL                         11.3.0
absl                        2.3.1
adjustText                  1.3.0
appnope                     0.1.4
argparse                    1.1
arrow                       1.3.0
attr                        25.3.0
attrs                       25.3.0
babel                       2.17.0
certifi                     2025.08.03
cffi                        2.0.0
charset_normalizer          3.4.3
cloudpickle                 3.1.1
comm                        0.2.3
coverage                    7.11.0
csv                         1.0
ctypes                      1.1.0
cycler                      0.12.1
cython                      3.1.4
dask                        2024.11.2
dateutil                    2.9.0.post0
debugpy                     1.8.17
decimal                     1.70
decorator                   5.2.1
defusedxml                  0.7.1
deprecated                  1.2.18
docrep                      0.3.2
executing                   2.2.1
formulaic                   1.2.0
formulaic_contrasts         1.0.0
fsspec                      2025.9.0
geopandas                   1.1.1
h5py                        3.14.0
idna                        3.10
igraph                      0.11.9
interface_meta              1.3.0
ipaddress                   1.0
ipykernel                   6.30.1
ipywidgets                  8.1.7
jedi                        0.19.2
jinja2                      3.1.6
joblib                      1.5.2
json                        2.0.9
json5                       0.12.1
jsonpointer                 3.0.0
jsonschema                  4.25.1
jupyter_events              0.12.0
jupyter_server              2.17.0
jupyterlab_server           2.27.3
kiwisolver                  1.4.9
lark                        1.2.2
leidenalg                   0.10.2
lightning                   2.5.5
lightning_utilities         0.15.2
llvmlite                    0.45.0
logging                     0.5.1.2
markupsafe                  3.0.2
marshal                     4
matplotlib                  3.10.6
matplotlib_inline           0.1.7
ml_collections              1.1.0
mpmath                      1.3.0
mudata                      0.3.2
narwhals                    2.5.0
natsort                     8.4.0
nbformat                    5.10.4
numba                       0.62.0
numcodecs                   0.15.1
numpy                       2.3.3
opt_einsum                  3.4.0
packaging                   25.0
parso                       0.8.5
patsy                       1.0.1
platform                    1.0.8
platformdirs                4.4.0
polars                      1.33.1
prompt_toolkit              3.0.52
psutil                      7.1.0
pure_eval                   0.2.3
pyarrow                     21.0.0
pycparser                   2.23
pydeseq2                    0.5.2
pygments                    2.19.2
pyparsing                   3.2.4
pyproj                      3.7.2
pyro                        1.9.1
pytz                        2025.2
re                          2.2.1
requests                    2.32.5
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
scanpy                      1.11.3
scipy                       1.15.3
scvi                        1.4.0
seaborn                     0.13.2
shapely                     2.1.2
six                         1.17.0
sklearn                     1.7.2
sniffio                     1.3.1
socketserver                0.4
sparse                      0.17.0
sqlite3                     2.6.0
stack_data                  0.6.3
statsmodels                 0.14.5
stdlib_list                 0.11.1
sympy                       1.14.0
tarfile                     0.9.0
texttable                   1.7.0
threadpoolctl               3.6.0
tlz                         1.0.0
toolz                       1.0.0
torch                       2.8.0
torchmetrics                1.8.2
tornado                     6.5.2
tqdm                        4.67.1
traitlets                   5.14.3
urllib3                     2.5.0
wcwidth                     0.2.13
websocket                   1.8.0
wrapt                       1.17.3
xarray                      2025.9.0
yaml                        6.0.2
zarr                        2.18.7
zlib                        1.0
zmq                         27.1.0
-----
IPython             9.5.0
jupyter_client      8.6.3
jupyter_core        5.8.1
jupyterlab          4.4.7
notebook            7.4.5
-----
Python 3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
macOS-26.0.1-arm64-arm-64bit
10 logical CPU cores, arm
-----
Session information updated at 2025-10-22 16:38