grassp.pp.annotate_uniprot_cc

grassp.pp.annotate_uniprot_cc#

annotate_uniprot_cc(data, protein_id_column=None, include_multiloc=True)[source]#

Annotate proteins with UniProt subcellular location annotations.

Queries the UniProt REST API to retrieve subcellular localization (CC) data for each protein in the dataset. Uses the UniProt controlled vocabulary to map location IDs to standardized terms and determine hierarchical relationships. The fine location is the specific subcellular location term, while the coarse location is the root of the hierarchy (determined by following HP part-of relationships in the vocabulary).

This function adds annotation columns to .obs and modifies the AnnData object in-place.

Parameters:
data AnnData

AnnData object with proteins in .obs (proteins_as_obs=True).

protein_id_column str | None (default: None)

Column in .obs containing UniProt IDs. If None, uses obs_names.

include_multiloc bool (default: True)

If True, include columns with all locations (comma-separated) for multi-localizing proteins. If False, only include primary (first) location columns.

Return type:

None

Returns:

None Modifies data.obs in-place by adding columns:

  • uniprot_cc_primary_coarse: First location at top hierarchy level

  • uniprot_cc_primary_fine: First location at most specific level

  • uniprot_cc_all_coarse: All coarse locations, comma-separated (only if include_multiloc=True)

  • uniprot_cc_all_fine: All fine locations, comma-separated (only if include_multiloc=True)

Proteins with missing or failed queries will have NaN values.

Examples

>>> import grassp as gr
>>> adata = gr.datasets.hein_2024()
>>> gr.pp.annotate_uniprot_cc(adata)
>>> adata.obs[['uniprot_cc_primary_coarse', 'uniprot_cc_primary_fine']].head()

Filter for nuclear proteins:

>>> nuclear = adata[adata.obs['uniprot_cc_primary_coarse'] == 'Nucleus']