Differential gene expression analysis

Differential gene expression analysis#

To perform differential gene expression analysis we have several alternatives and modes:

Single-cell level
Pseudobulk level

The dotools_py package includes two functions to automatically test for DEA between two conditions for all the celltypes we have defined in our object, as well a consensus function to run both approaches.

Environment setup#

import anndata as ad
import dotools_py as do
import session_info


adata = ad.read_h5ad("/Users/david/Downloads/Data10x/adata.h5ad")
adata

2025-10-22 16:33:19,274 - Jupyter enviroment detected. Using "inline" backend

AnnData object with n_obs × n_vars = 2783 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap'
    obsm: 'X_CCA', 'X_pca', 'X_umap'
    layers: 'counts', 'logcounts'
    obsp: 'connectivities', 'distances'

DEA at the single-cell level#

Among the test we can use we have: wilcoxon, t-test, logistic regression, t-test with overestimation of the variance and the MAST test. The MAST test can be run directly using the do.tl.run_mast, the other tests can be run with do.tl.rank_genes_groups. Alternatively, we can use do.tl.rank_genes_condition, to use any of the test and automatically test for all the cell-types

To reduce the computation time, we are going to only use the NK cells

nk = adata[adata.obs.annotation == "NK"].copy()

df = do.tl.rank_genes_condition(
    nk,
    groupby="condition",
    subset_by="annotation",
    reference="healthy",
    groups=["disease"],
    method="mast",
    get_results=True,
)

2025-10-22 16:34:01,821 - Running DGEs for NK.
2025-10-22 16:34:01,824 - Running MAST test
2025-10-22 16:34:12,968 - Reference level is set to healthy

df.head(10)

	GeneName	log2fc	pvals	padj	pts_group	pts_ref	group	annotation
0	A1BG	2.604576	0.001125	0.021286	0.200000	0.042553	disease	NK
1	A1BG-AS1	-22.228531	0.781334	1.000000	0.000000	0.002660	disease	NK
2	A2M	-24.127653	0.639331	1.000000	0.000000	0.005319	disease	NK
3	A2M-AS1	-0.216740	0.636112	1.000000	0.133333	0.170213	disease	NK
4	A4GALT	0.000000	1.000000	1.000000	0.000000	0.000000	disease	NK
5	AAAS	-1.100035	0.694980	1.000000	0.022222	0.047872	disease	NK
6	AACS	-25.872604	0.328578	0.901818	0.000000	0.015957	disease	NK
7	AAED1	0.645098	0.001607	0.028256	0.044444	0.090426	disease	NK
8	AAGAB	0.570027	0.643581	1.000000	0.088889	0.066489	disease	NK
9	AAK1	-0.717432	0.129302	0.568715	0.333333	0.465426	disease	NK

DEA at the pseudobulk level#

To perform differential gene expression using a pseudobulk approach we can use do.tl.rank_genes_pseudobulk, which test between two conditions for each cell-type. We can use DESEq2 or edgeR. In this case we need to generate pseudo-replicates since we only have one sample per condition.

df = do.tl.rank_genes_pseudobulk(
    adata,
    ctrl_cond="healthy",
    disease_cond="disease",
    cluster_key="annotation",
    batch_key="batch",
    condition_key="condition",
    design="~condition",
    min_cells=30,
    min_counts=10,
    method="deseq2",
    pseudobulk_approach="sum",
    technical_replicates=2,
    get_results=True,
)

2025-10-22 16:36:05,602 - Generating Pseudo-bulk data
2025-10-22 16:36:23,693 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:23,696 - Run DESeq2
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
          baseMean  log2FoldChange     lfcSE      stat    pvalue  padj
A1BG      5.669955        0.403825  1.067884  0.378154  0.705316   NaN
A1BG-AS1  0.738919        2.278757  3.406331  0.668977  0.503510   NaN
A2M-AS1   0.157668        0.021039  5.332329  0.003946  0.996852   NaN
A4GALT    2.649744        4.131701  2.716025  1.521231  0.128202   NaN
AAAS      1.998178       -0.768574  1.708237 -0.449923  0.652766   NaN
...            ...             ...       ...       ...       ...   ...
ZXDB      1.219480       -4.522738  3.132528 -1.443798  0.148796   NaN
ZXDC      3.052221       -0.368758  1.482192 -0.248793  0.803521   NaN
ZYG11B    2.702957        1.185403  1.593806  0.743756  0.457024   NaN
ZYX       4.646288        0.371801  1.177736  0.315692  0.752237   NaN
ZZEF1     2.725794       -0.534042  1.487064 -0.359125  0.719501   NaN

[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet initialization
2025-10-22 16:36:27,369 - Test could not be computed for Monocytes due to You specified a non-existant category for condition. Possible categories: healthy
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
           baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A1BG       9.584556        2.614643  1.247043  2.096674  0.036022  0.102957
A1BG-AS1   0.083831        1.607201  5.343868  0.300756  0.763600       NaN
A2M-AS1   10.868651       -0.295972  0.595112 -0.497339  0.618950  0.740909
A4GALT     0.000000             NaN       NaN       NaN       NaN       NaN
AAAS       2.234670       -1.123438  1.523501 -0.737405  0.460876       NaN
...             ...             ...       ...       ...       ...       ...
ZXDB       0.337953       -0.396078  2.704777 -0.146436  0.883577       NaN
ZXDC       1.524730       -2.568738  2.527960 -1.016131  0.309567       NaN
ZYG11B     0.252808        0.021401  2.999455  0.007135  0.994307       NaN
ZYX       11.933621       -1.723495  0.810586 -2.126234  0.033484  0.098208
ZZEF1      3.545880       -3.787932  2.403851 -1.575777  0.115077       NaN

[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value: condition disease vs healthy
           baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A1BG      41.778608       -0.274318  0.380225 -0.721461  0.470626  0.670616
A1BG-AS1   4.012429       -0.926947  1.281293 -0.723446  0.469406  0.669769
A2M-AS1    4.688797       -1.240566  1.236603 -1.003204  0.315762  0.530839
A4GALT     0.000000             NaN       NaN       NaN       NaN       NaN
AAAS      15.380437       -1.665547  0.709661 -2.346961  0.018927  0.073339
...             ...             ...       ...       ...       ...       ...
ZXDB       5.162987        1.357682  1.049996  1.293036  0.195999  0.389684
ZXDC      18.622773       -0.592434  0.604882 -0.979422  0.327372  0.542457
ZYG11B    18.725857       -0.319720  0.563125 -0.567761  0.570197  0.748763
ZYX       59.401751       -0.652206  0.324146 -2.012075  0.044212  0.138577
ZZEF1     23.713091       -0.187387  0.493061 -0.380049  0.703909  0.841674

[12711 rows x 6 columns]

df.head(10)

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	group
A1BG	5.669955	0.403825	1.067884	0.378154	0.705316	1.000000	B_cells
A1BG-AS1	0.738919	2.278757	3.406331	0.668977	0.503510	1.000000	B_cells
A2M-AS1	0.157668	0.021039	5.332329	0.003946	0.996852	1.000000	B_cells
A4GALT	2.649744	4.131701	2.716025	1.521231	0.128202	1.000000	B_cells
AAAS	1.998178	-0.768574	1.708237	-0.449923	0.652766	1.000000	B_cells
AACS	1.038974	0.518157	2.502786	0.207032	0.835985	1.000000	B_cells
AAED1	9.829811	1.880914	0.951986	1.975780	0.048180	0.163028	B_cells
AAGAB	2.013459	0.515767	1.758700	0.293266	0.769319	1.000000	B_cells
AAK1	0.816817	0.059553	2.702106	0.022039	0.982417	1.000000	B_cells
AAMDC	1.314319	-0.937528	2.182265	-0.429613	0.667478	1.000000	B_cells

DEA consensus#

Additionally, the do.tl.rank_genes_consensus allow to perform both single-cell and pseudo-bulk DEA and generate a dataframe that summarises everything.

df = do.tl.rank_genes_consensus(
    adata,
    ctrl_cond="healthy",
    disease_cond="disease",
    cluster_key="annotation",
    batch_key="batch",
    condition_key="condition",
    min_cells=30,
    min_counts=10,
    pseudobulk_approach="sum",
    technical_replicates=2,
    get_results=True,
    test_pseudobulk="edger",
    test="wilcoxon",
)

2025-10-22 16:36:53,084 - Running wilcoxon
2025-10-22 16:36:53,151 - Running DGEs for B_cells.
2025-10-22 16:36:53,153 - Running wilcoxon test.
2025-10-22 16:36:53,480 - Running DGEs for Monocytes.
2025-10-22 16:36:53,482 - Running wilcoxon test.
2025-10-22 16:36:53,565 - Running DGEs for NK.
2025-10-22 16:36:53,566 - Running wilcoxon test.
2025-10-22 16:36:53,677 - Running DGEs for T_cells.
2025-10-22 16:36:53,680 - Running wilcoxon test.
2025-10-22 16:36:54,161 - Running DGEs for pDC.
2025-10-22 16:36:54,163 - Running wilcoxon test.
2025-10-22 16:36:54,197 - Running edger
2025-10-22 16:36:54,197 - Generating Pseudo-bulk data
2025-10-22 16:36:57,684 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:57,699 - Run edgeR
2025-10-22 16:36:57,703 - Running DEA for B_cells
2025-10-22 16:37:08,520 - Running DEA for Monocytes
2025-10-22 16:37:18,426 - Test could not be computed for Monocytes due to [Errno 2] No such file or directory: '/tmp/EdgeR_Test_eb8e58c43d00401db58be2acaae8a194/dge_Monocytes_edgeR.csv'
2025-10-22 16:37:18,427 - Running DEA for NK
2025-10-22 16:37:28,661 - Running DEA for T_cells
2025-10-22 16:37:38,776 - Generating consensus DataFrame

df.head(10)

	GeneName	wilcox_score	log2fc	pvals	padj	pts_group	pts_ref	group	annotation	log2fc_edger	stat_edger	pval_edger	padj_edger	sc_signicant	psc_signicant	consensus_significant	MeanExpr_disease	MeanExpr_healthy
0	JUND	10.587980	2.167520	3.388231e-26	1.254797e-22	0.982558	0.941176	disease	B_cells	2.150242	266.374184	1.508208e-09	5.454230e-07	Yes	Yes	Yes	4.278455	2.823328
1	CD83	10.266931	4.102563	9.931388e-25	3.064992e-21	0.895349	0.176471	disease	B_cells	3.985247	289.033574	9.565300e-10	3.818415e-07	Yes	Yes	Yes	2.662119	0.574226
2	RPS4Y1	10.197039	32.221443	2.044093e-24	5.407211e-21	0.813953	0.000000	disease	B_cells	9.754567	211.691989	1.603596e-08	3.189553e-06	Yes	Yes	Yes	1.793008	0.000000
3	FOS	10.120420	4.805681	4.484900e-24	1.038086e-20	0.912791	0.308824	disease	B_cells	4.670979	681.658744	6.302140e-12	1.253496e-08	Yes	Yes	Yes	3.868105	0.984030
4	HSP90AA1	9.908383	2.990328	3.827898e-23	7.875687e-20	0.959302	0.632353	disease	B_cells	3.064642	316.093902	5.622099e-10	3.621600e-07	Yes	Yes	Yes	3.364896	1.507360
5	CREM	9.665054	5.431766	4.243946e-22	7.144105e-19	0.802326	0.058824	disease	B_cells	5.363856	299.596476	7.086365e-10	3.621600e-07	Yes	Yes	Yes	2.367836	0.202228
6	YPEL5	9.301365	3.094231	1.386545e-20	1.974973e-17	0.889535	0.323529	disease	B_cells	2.976510	147.040320	4.389385e-08	6.073501e-06	Yes	Yes	Yes	2.494147	0.833389
7	SRGN	9.120118	4.315721	7.504225e-20	8.684733e-17	0.802326	0.161765	disease	B_cells	4.067877	206.063003	6.646860e-09	1.652575e-06	Yes	Yes	Yes	2.626802	0.497278
8	TUBB4B	8.940316	3.670844	3.880572e-19	3.992031e-16	0.784884	0.147059	disease	B_cells	3.703473	165.803423	2.310259e-08	4.177369e-06	Yes	Yes	Yes	1.940737	0.384084
9	RGS1	8.825177	7.311319	1.092864e-18	9.233199e-16	0.697674	0.029412	disease	B_cells	7.306141	567.203038	4.569613e-12	1.253496e-08	Yes	Yes	Yes	2.666053	0.080899

adata

AnnData object with n_obs × n_vars = 2783 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap', 'rank_genes_pseudobulk', 'rank_genes_condition', 'rank_genes_consensus'
    obsm: 'X_CCA', 'X_pca', 'X_umap'
    layers: 'counts', 'logcounts'
    obsp: 'connectivities', 'distances'

As we can appreciate, the results of the DEA will be saved in the uns attributed

session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)

Click to view session information

-----
anndata             0.11.4
dotools_py          0.0.1
pandas              2.3.2
session_info        v1.0.1
sys                 3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
-----

Click to view modules imported as dependencies

Cython                      3.1.4
PIL                         11.3.0
absl                        2.3.1
adjustText                  1.3.0
appnope                     0.1.4
argparse                    1.1
arrow                       1.3.0
attr                        25.3.0
attrs                       25.3.0
babel                       2.17.0
certifi                     2025.08.03
cffi                        2.0.0
charset_normalizer          3.4.3
cloudpickle                 3.1.1
comm                        0.2.3
coverage                    7.11.0
csv                         1.0
ctypes                      1.1.0
cycler                      0.12.1
cython                      3.1.4
dask                        2024.11.2
dateutil                    2.9.0.post0
debugpy                     1.8.17
decimal                     1.70
decorator                   5.2.1
defusedxml                  0.7.1
deprecated                  1.2.18
docrep                      0.3.2
executing                   2.2.1
formulaic                   1.2.0
formulaic_contrasts         1.0.0
fsspec                      2025.9.0
geopandas                   1.1.1
h5py                        3.14.0
idna                        3.10
igraph                      0.11.9
interface_meta              1.3.0
ipaddress                   1.0
ipykernel                   6.30.1
ipywidgets                  8.1.7
jedi                        0.19.2
jinja2                      3.1.6
joblib                      1.5.2
json                        2.0.9
json5                       0.12.1
jsonpointer                 3.0.0
jsonschema                  4.25.1
jupyter_events              0.12.0
jupyter_server              2.17.0
jupyterlab_server           2.27.3
kiwisolver                  1.4.9
lark                        1.2.2
leidenalg                   0.10.2
lightning                   2.5.5
lightning_utilities         0.15.2
llvmlite                    0.45.0
logging                     0.5.1.2
markupsafe                  3.0.2
marshal                     4
matplotlib                  3.10.6
matplotlib_inline           0.1.7
ml_collections              1.1.0
mpmath                      1.3.0
mudata                      0.3.2
narwhals                    2.5.0
natsort                     8.4.0
nbformat                    5.10.4
numba                       0.62.0
numcodecs                   0.15.1
numpy                       2.3.3
opt_einsum                  3.4.0
packaging                   25.0
parso                       0.8.5
patsy                       1.0.1
platform                    1.0.8
platformdirs                4.4.0
polars                      1.33.1
prompt_toolkit              3.0.52
psutil                      7.1.0
pure_eval                   0.2.3
pyarrow                     21.0.0
pycparser                   2.23
pydeseq2                    0.5.2
pygments                    2.19.2
pyparsing                   3.2.4
pyproj                      3.7.2
pyro                        1.9.1
pytz                        2025.2
re                          2.2.1
requests                    2.32.5
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
scanpy                      1.11.3
scipy                       1.15.3
scvi                        1.4.0
seaborn                     0.13.2
shapely                     2.1.2
six                         1.17.0
sklearn                     1.7.2
sniffio                     1.3.1
socketserver                0.4
sparse                      0.17.0
sqlite3                     2.6.0
stack_data                  0.6.3
statsmodels                 0.14.5
stdlib_list                 0.11.1
sympy                       1.14.0
tarfile                     0.9.0
texttable                   1.7.0
threadpoolctl               3.6.0
tlz                         1.0.0
toolz                       1.0.0
torch                       2.8.0
torchmetrics                1.8.2
tornado                     6.5.2
tqdm                        4.67.1
traitlets                   5.14.3
urllib3                     2.5.0
wcwidth                     0.2.13
websocket                   1.8.0
wrapt                       1.17.3
xarray                      2025.9.0
yaml                        6.0.2
zarr                        2.18.7
zlib                        1.0
zmq                         27.1.0

-----
IPython             9.5.0
jupyter_client      8.6.3
jupyter_core        5.8.1
jupyterlab          4.4.7
notebook            7.4.5
-----
Python 3.11.13 (main, Jun  5 2025, 08:21:08) [Clang 14.0.6 ]
macOS-26.0.1-arm64-arm-64bit
10 logical CPU cores, arm
-----
Session information updated at 2025-10-22 16:38