grassp.pp.annotate_uniprot_cc#
- annotate_uniprot_cc(data, protein_id_column=None, include_multiloc=True)[source]#
Annotate proteins with UniProt subcellular location annotations.
Queries the UniProt REST API to retrieve subcellular localization (CC) data for each protein in the dataset. Uses the UniProt controlled vocabulary to map location IDs to standardized terms and determine hierarchical relationships. The fine location is the specific subcellular location term, while the coarse location is the root of the hierarchy (determined by following HP part-of relationships in the vocabulary).
This function adds annotation columns to
.obsand modifies the AnnData object in-place.- Parameters:
- data
AnnData AnnData object with proteins in
.obs(proteins_as_obs=True).- protein_id_column
str|None(default:None) Column in
.obscontaining UniProt IDs. If None, usesobs_names.- include_multiloc
bool(default:True) If True, include columns with all locations (comma-separated) for multi-localizing proteins. If False, only include primary (first) location columns.
- data
- Return type:
- Returns:
None Modifies
data.obsin-place by adding columns:uniprot_cc_primary_coarse: First location at top hierarchy leveluniprot_cc_primary_fine: First location at most specific leveluniprot_cc_all_coarse: All coarse locations, comma-separated (only ifinclude_multiloc=True)uniprot_cc_all_fine: All fine locations, comma-separated (only ifinclude_multiloc=True)
Proteins with missing or failed queries will have NaN values.
Examples
>>> import grassp as gr >>> adata = gr.datasets.hein_2024() >>> gr.pp.annotate_uniprot_cc(adata) >>> adata.obs[['uniprot_cc_primary_coarse', 'uniprot_cc_primary_fine']].head()
Filter for nuclear proteins:
>>> nuclear = adata[adata.obs['uniprot_cc_primary_coarse'] == 'Nucleus']