dotools_py.utility.add_gene_metadata

dotools_py.utility.add_gene_metadata#

dotools_py.utility.add_gene_metadata(data, gene_key, species='mouse', add_gene_id=False)[source]#

Add gene metadata to AnnData or DataFrame.

Add gene metadata obtained from the GTF or Uniprot-database. This information includes, the gene biotype (e.g., protein-coding, lncRNA, etc.); the ENSEMBL gene ID and the subcellular location.

Parameters:
data AnnData | DataFrame

Annotated data matrix or pandas dataframe with for example results from differential gene expression analysis.

gene_key str

name of the key with gene names. If an AnnData is provided the .var name column name with gene names. If the gene names are in var_names, specify var_names.

species Literal['mouse', 'human'] (default: 'mouse')

the input species.

add_gene_id bool (default: False)

Add gene id (ENSEMBL ID) information.

Return type:

AnnData | DataFrame

Returns:

Returns a dataframe or AnnData object. Three new columns will be set: biotype, locations and gene_id.

Examples

>>> import dotools_py as do
>>> # AnnData Input
>>> adata = do.dt.example_10x_processed()
>>> adata = add_gene_metadata(adata, "var_names", "human")
>>> adata.var[["biotype", "gene_id", "locations"]].head(5)
                       biotype          gene_id                locations
ATP2A1-AS1          lncRNA  ENSG00000260442  Unreview status Uniprot
STK17A      protein_coding  ENSG00000164543                  nucleus
C19orf18    protein_coding  ENSG00000177025                 membrane
TPP2        protein_coding  ENSG00000134900        nucleus,cytoplasm
MFSD1       protein_coding  ENSG00000118855       membrane,cytoplasm
>>>
>>> # Dataframe Input
>>> df = pd.DataFrame(["Acta2", "Tagln", "Ptprc", "Vcam1"], columns=["genes"])
>>> df = add_gene_metadata(df, "genes")
>>> df.head()
       genes         biotype          locations             gene_id
0  Acta2  protein_coding          cytoplasm  ENSMUSG00000035783
1  Tagln  protein_coding          cytoplasm  ENSMUSG00000032085
2  Ptprc  protein_coding           membrane  ENSMUSG00000026395
3  Vcam1  protein_coding  secreted,membrane  ENSMUSG00000027962