dotools_py.pp.importer_py#
- dotools_py.pp.importer_py(paths, ids, metadata=None, batch_key='batch', min_genes_in_cell=300, min_cells_with_genes=5, cut_mt=5, n_reads=10000, min_counts=None, max_counts=None, min_genes=None, max_genes=None, low_quantile=None, high_quantile=None, remove_doublets=True, doublet_tool='scDblFinder', normalisation_method='LogNormalisation', log_data=True, metrics_patterns=('mt-', ('rbs', 'rpl')), metrics_names=('mt', 'ribo'), random_state=0, technology='snrna')[source]#
Quality control analysis for scRNA / Spatial Transcriptomics.
The input is a list with paths to H5 files generated with CellRanger, Cellbender, or STARsolo or SpaceRanger A list of batch names for each sample must also be provided. Optionally, a dictionary with additional metadata can be passed. The order of batch names and metadata must match the order of the file paths.
For each sample, several quality and filtering steps are applied:
Filter genes expressed in a low number of cells.
Filter cells with a low number of genes.
Filter cells with high mitochondrial content (recommended: 5% for scRNA, 3% for snRNA).
- Filter cells based on nUMI and features using two modes:
Absolute filtering: Sets absolute values for min/max UMI and features.
Quantile filtering: Filters top/lower quantiles.
Remove doublets using scDblFinder, Scrublet, or DoubletDetection.
An Excel sheet summarizing how many cells/genes were removed at each step will be generated, along with violin plots showing the distribution of
total_counts,n_genes_by_counts, andpct_mt_contentbefore and after QC. These outputs will be saved in the folder containing the H5 files.After QC, the data will be log-normalized and scaled. Highly variable genes and PCA will also be computed.
Note
Depending on the type of technology some steps will be omitted or adapted.
- Parameters:
- paths
list list with the path to the H5 files.
- ids
list list with the batch name for each sample.
- metadata
dict|None(default:None) dictionary with metadata information.
- batch_key
str(default:'batch') key in
.obsfor the batch information.- remove_doublets
bool(default:True) if set to True, neotypic doublets will be removed.
- doublet_tool
Literal['scDblFinder','Scrublet','DoubletDetection'] (default:'scDblFinder') doublet tool to use. Available scDblFinder, Scrublet and DoubletDetection.
- min_genes_in_cell
int(default:300) minimum number of genes per cell.
- min_cells_with_genes
int(default:5) minimum cells expressing a genes.
- n_reads
int(default:10000) target sum after normalization per cell.
- cut_mt
int|None(default:5) maximum percentage of mitochondrial genes per cell.
- min_counts
int|None(default:None) minimum number of counts per cell.
- max_counts
int|None(default:None) maximum number of counts per cell.
- min_genes
int|None(default:None) minimum number of genes per cell.
- max_genes
int|None(default:None) maximum number of genes per cell.
- low_quantile
int|None(default:None) low quantile to filter cells based on counts.
- high_quantile
int|None(default:None) upper quantile to filter cells based on counts.
- normalisation_method
Literal['LogNormalisation','PearsonResiduals'] (default:'LogNormalisation') Type of normalization method.
- log_data
bool(default:True) Whether to log data after normalization or not.
- metrics_patterns
tuple(default:('mt-', ('rbs', 'rpl'))) Patterns to use to annotate features. Use
mt-for mitochondrial,rpsandrplfor ribosomal, and^hb*-for hemoglobin. Should be written in lowercase.- metrics_names
list(default:('mt', 'ribo')) Name for the patterns use “mt” for mitochondrial, “ribo” for ribosomal and “hb” for hemoglobin.
- technology
Literal['snrna','scrna','visium','xenium'] (default:'snrna') Type of the input dataset.
- random_state
int(default:0) seed for random number generator.
- paths
- Returns:
Returns an Annotated data matrix of shape
n_obsxn_varswith all the samples concatenated.
Example
>>> import dotools_py as do >>> paths = ["/path/sample1", "/path/sample2"] >>> batchname = ["sample1", "sample2"] >>> metadata = { ... "condition": ["WT", "KO"], ... "age": ["3m", "3m"], ... } >>> adata = do.pp.importer_py( ... paths=paths, ... ids=batchname, ... metadata=metadata, ... batch_key="batch", ... remove_doublets=True, ... min_genes_in_cell=300, ... min_cells_with_genes=5, ... n_reads=10_000, ... cut_mt=5, ... high_quantile=95, ... min_counts=500, ... )