dotools_py.tl.reclustering

Contents

dotools_py.tl.reclustering#

dotools_py.tl.reclustering(adata, cluster_key, batch_key, recluster_approach, use_clusters=None, bbknn=False, hvg_batch=False, use_rep=None, resolution=0.3, neighbors_batch=3, automatic_annot=False, majority=True, convert=True, model='Healthy_Adult_Heart.pkl', get_subset=False, key_added='annotation_recluster', key_added_autoannot='autoAnnot_recluster', random_state=0)[source]#

Re-clustering of dataset.

Perform reclustering on an integrated AnnData object. Can recluster for the following integration methods:
  • CCA (v4/v5) integration from Seurat.

  • Harmony integration.

  • BBKNN integration.

  • SCVI integration.

  • PCA.

Assume that X has logcounts.

Note

For CCA (v4/v5) and scVI the corrected expression matrix (CC4 v5), the CCA representation (CCA v5) and the latent space (scvi) to be in .obsm. When re-clustering with harmony and BBKNN the pipeline will be re-run over the clusters.

Parameters:
adata AnnData

Annotated data matrix.

cluster_key str

Metadata column in obs with cluster groups.

batch_key str

Metadata column in obs with batch groups.

use_clusters str | list | None (default: None)

Clusters in cluster_key to re-cluster. If several clusters are provided, the re-clustering will be performed subsetting for all the clusters specified.

hvg_batch bool (default: False)

If set to True. The highly variable genes that are shared across samples will be used.

recluster_approach Literal['cca4', 'cca5', 'harmony', 'scanorama', 'pca', 'scvi']

Reclustering approach to use.

bbknn bool (default: False)

Use BBKNN to compute neighbors.

use_rep str (default: None)

Name in obsm with the representation. Required for SCVI, CCA and Scanorama approach.

resolution float (default: 0.3)

Resolution for the leiden clustering.

neighbors_batch int (default: 3)

To compute the nearest neighbors distance matrix and a neighborhood graph of observations a BBKNN is employed, which calculate a batch balanced KNN graph. It is recommended to use 3 with when <100000 cells and 25 for >100000. If there are not enough cells per batch the default approach will be used (sc.pp.neighbors).

automatic_annot bool (default: False)

Perform semi-automatic annotation with Celltypist.

majority bool (default: True)

Whether to refine the predicted labels by running the majority voting classifier after over-clustering.

convert bool (default: True)

Convert the gene format of the model. If a Human model is provided, and is set to True, then gene in mouse format will be use and viceverse.

model str (default: 'Healthy_Adult_Heart.pkl')

Celltypist model to use for the prediction.

get_subset bool (default: False)

if set to True, returns an AnnData of use_clusters after re-clustering.

key_added str (default: 'annotation_recluster')

metadata column name in obs to save reclustering information.

key_added_autoannot str (default: 'autoAnnot_recluster')

metadata column name in obs to save reclustering information after automatic annotation.

random_state int (default: 0)

seed for random number generator.

Return type:

AnnData | None

Returns:

Returns None if get_subset is set to False, otherwise a subsetted AnnData after the re-clustering is returned. Additionally, the following fields will be set:

adata.obs['annotation_recluster' | key_added]pandas.Series (dtype category)

Array that stores the re-clusters groups consisting of the original group_id + the new cluster id (e.g., for a the monocyte cluster with 3 sub-clusters the new clusters are monocyte_0, monocyte_1, and monocyte_2).

adata.obs['autoAnnot_recluster' | key_added_autoannot]pandas.Series (dtype category)

Array that stores the re-clusters groups after re-running the automatic annotation pipeline.

See also

dotools_py.tl.full_recluster()

Recluster all clusters automatically

Example

>>> import dotools_py as do
>>> adata = do.dt.example_10x_processed()
>>> t_cells = do.tl.reclustering(adata, "annotation", "batch", "harmony", use_clusters="T_cells", get_subset=True)
>>> t_cells
AnnData object with n_obs × n_vars = 464 × 1851
obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts',
     'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo',
     'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'cell_type',
     'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
var: 'mean', 'std', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches',
     'highly_variable_intersection'
uns: 'annotation_colors', 'annotation_recluster_colors', 'batch_colors', 'hvg', 'leiden', 'leiden_colors', 'log1p',
     'neighbors', 'pca', 'umap'
obsm: 'X_CCA', 'X_pca', 'X_umap', 'X_pca_harmony'
varm: 'PCs'
layers: 'counts', 'logcounts'
obsp: 'connectivities', 'distances'
>>> adata
AnnData object with n_obs × n_vars = 700 × 1851
obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts',
     'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo',
     'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'cell_type', 'autoAnnot',
     'celltypist_conf_score', 'annotation', 'annotation_recluster'
var: 'mean', 'std', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches',
     'highly_variable_intersection'
uns: 'annotation_colors', 'annotation_recluster_colors', 'batch_colors', 'hvg', 'leiden', 'leiden_colors', 'log1p',
     'neighbors', 'pca', 'umap'
obsm: 'X_CCA', 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'counts', 'logcounts'
obsp: 'connectivities', 'distances'