dotools_py.pp.importer_py

Contents

dotools_py.pp.importer_py#

dotools_py.pp.importer_py(paths, ids, metadata=None, batch_key='batch', remove_doublets=True, doublet_tool='scDblFinder', min_genes_in_cell=300, min_cells_with_genes=5, cut_mt=5, n_reads=10000, min_counts=None, max_counts=None, min_genes=None, max_genes=None, low_quantile=None, high_quantile=None)[source]#

Quality control analysis for sc/snRNA.

The input is a list with paths to H5 files generated with CellRanger, Cellbender, or STARsolo. A list of batch names for each sample must also be provided. Optionally, a dictionary with additional metadata can be passed. The order of batch names and metadata must match the order of the file paths.

For each sample, several quality and filtering steps are applied:

  • Filter genes expressed in a low number of cells.

  • Filter cells with a low number of genes.

  • Filter cells with high mitochondrial content (recommended: 5% for scRNA, 3% for snRNA).

  • Filter cells based on nUMI and features using two modes:
    1. Absolute filtering: Sets absolute values for min/max UMI and features.

    2. Quantile filtering: Filters top/lower quantiles.

  • Remove doublets using scDblFinder, Scrublet, or DoubletDetection.

An Excel sheet summarizing how many cells/genes were removed at each step will be generated, along with violin plots showing the distribution of total_counts, n_genes_by_counts, and pct_mt_content before and after QC. These outputs will be saved in the folder containing the H5 files.

After QC, the data will be log-normalized and scaled. Highly variable genes and PCA will also be computed.

Parameters:
paths list

list with the path to the H5 files.

ids list

list with the batch name for each sample.

metadata dict | None (default: None)

dictionary with metadata information.

batch_key str (default: 'batch')

key in .obs for the batch information.

remove_doublets bool (default: True)

if set to True, neotypic doublets will be removed.

doublet_tool Literal['scDblFinder', 'Scrublet', 'DoubletDetection'] (default: 'scDblFinder')

doublet tool to use. Available scDblFinder, Scrublet and DoubletDetection.

min_genes_in_cell int (default: 300)

minimum number of genes per cell.

min_cells_with_genes int (default: 5)

minimum cells expressing a genes.

n_reads int (default: 10000)

target sum after normalisation per cell.

cut_mt int (default: 5)

maximum percentage of mitochondrial genes per cell.

min_counts int | None (default: None)

minimum number of counts per cell.

max_counts int | None (default: None)

maximum number of counts per cell.

min_genes int | None (default: None)

minimum number of genes per cell.

max_genes int | None (default: None)

maximum number of genes per cell.

low_quantile int | None (default: None)

low quantile to filter cells based on counts.

high_quantile int | None (default: None)

upper quantile to filter cells based on counts.

Return type:

AnnData

Returns:

Returns an Annotated data matrix of shape n_obs x n_vars with all the samples concatenated.

Example

>>> import dotools_py as do
>>> paths = ["/path/sample1", "/path/sample2"]
>>> batchname = ["sample1", "sample2"]
>>> metadata = {
...     "condition": ["WT", "KO"],
...     "age": ["3m", "3m"],
... }
>>> adata = do.pp.importer_py(
...     paths=paths,
...     ids=batchname,
...     metadata=metadata,
...     batch_key="batch",
...     remove_doublets=True,
...     min_genes_in_cell=300,
...     min_cells_with_genes=5,
...     n_reads=10_000,
...     cut_mt=5,
...     high_quantile=95,
...     min_counts=500,
... )