dotools_py.pp.importer_py

Contents

dotools_py.pp.importer_py#

dotools_py.pp.importer_py(paths, ids, metadata=None, batch_key='batch', min_genes_in_cell=300, min_cells_with_genes=5, cut_mt=5, n_reads=10000, min_counts=None, max_counts=None, min_genes=None, max_genes=None, low_quantile=None, high_quantile=None, remove_doublets=True, doublet_tool='scDblFinder', normalisation_method='LogNormalisation', log_data=True, metrics_patterns=('mt-', ('rbs', 'rpl')), metrics_names=('mt', 'ribo'), random_state=0, technology='snrna')[source]#

Quality control analysis for scRNA / Spatial Transcriptomics.

The input is a list with paths to H5 files generated with CellRanger, Cellbender, or STARsolo or SpaceRanger A list of batch names for each sample must also be provided. Optionally, a dictionary with additional metadata can be passed. The order of batch names and metadata must match the order of the file paths.

For each sample, several quality and filtering steps are applied:

  • Filter genes expressed in a low number of cells.

  • Filter cells with a low number of genes.

  • Filter cells with high mitochondrial content (recommended: 5% for scRNA, 3% for snRNA).

  • Filter cells based on nUMI and features using two modes:
    1. Absolute filtering: Sets absolute values for min/max UMI and features.

    2. Quantile filtering: Filters top/lower quantiles.

  • Remove doublets using scDblFinder, Scrublet, or DoubletDetection.

An Excel sheet summarizing how many cells/genes were removed at each step will be generated, along with violin plots showing the distribution of total_counts, n_genes_by_counts, and pct_mt_content before and after QC. These outputs will be saved in the folder containing the H5 files.

After QC, the data will be log-normalized and scaled. Highly variable genes and PCA will also be computed.

Note

Depending on the type of technology some steps will be omitted or adapted.

Parameters:
paths list

list with the path to the H5 files.

ids list

list with the batch name for each sample.

metadata dict | None (default: None)

dictionary with metadata information.

batch_key str (default: 'batch')

key in .obs for the batch information.

remove_doublets bool (default: True)

if set to True, neotypic doublets will be removed.

doublet_tool Literal['scDblFinder', 'Scrublet', 'DoubletDetection'] (default: 'scDblFinder')

doublet tool to use. Available scDblFinder, Scrublet and DoubletDetection.

min_genes_in_cell int (default: 300)

minimum number of genes per cell.

min_cells_with_genes int (default: 5)

minimum cells expressing a genes.

n_reads int (default: 10000)

target sum after normalization per cell.

cut_mt int | None (default: 5)

maximum percentage of mitochondrial genes per cell.

min_counts int | None (default: None)

minimum number of counts per cell.

max_counts int | None (default: None)

maximum number of counts per cell.

min_genes int | None (default: None)

minimum number of genes per cell.

max_genes int | None (default: None)

maximum number of genes per cell.

low_quantile int | None (default: None)

low quantile to filter cells based on counts.

high_quantile int | None (default: None)

upper quantile to filter cells based on counts.

normalisation_method Literal['LogNormalisation', 'PearsonResiduals'] (default: 'LogNormalisation')

Type of normalization method.

log_data bool (default: True)

Whether to log data after normalization or not.

metrics_patterns tuple (default: ('mt-', ('rbs', 'rpl')))

Patterns to use to annotate features. Use mt- for mitochondrial, rps and rpl for ribosomal, and ^hb*- for hemoglobin. Should be written in lowercase.

metrics_names list (default: ('mt', 'ribo'))

Name for the patterns use “mt” for mitochondrial, “ribo” for ribosomal and “hb” for hemoglobin.

technology Literal['snrna', 'scrna', 'visium', 'xenium'] (default: 'snrna')

Type of the input dataset.

random_state int (default: 0)

seed for random number generator.

Returns:

Returns an Annotated data matrix of shape n_obs x n_vars with all the samples concatenated.

Example

>>> import dotools_py as do
>>> paths = ["/path/sample1", "/path/sample2"]
>>> batchname = ["sample1", "sample2"]
>>> metadata = {
...     "condition": ["WT", "KO"],
...     "age": ["3m", "3m"],
... }
>>> adata = do.pp.importer_py(
...     paths=paths,
...     ids=batchname,
...     metadata=metadata,
...     batch_key="batch",
...     remove_doublets=True,
...     min_genes_in_cell=300,
...     min_cells_with_genes=5,
...     n_reads=10_000,
...     cut_mt=5,
...     high_quantile=95,
...     min_counts=500,
... )