grassp.pp.remove_cRAP_proteins#
- remove_cRAP_proteins(data, id_column=None, id_type='uniprot', inplace=True, verbose=True)[source]#
Remove cRAP (common Repository of Adventitious Proteins) contaminants.
This function removes common laboratory contaminants from proteomics datasets using the cRAP database maintained at https://ftp.thegpm.org/fasta/crap/. Protein IDs are matched against the cRAP database, with support for both UniProt accession IDs (e.g., P00330) and entry names (e.g., ADH1_YEAST).
- Parameters:
- data
AnnData The annotated data matrix with proteins as observations (rows).
- id_column
str|None(default:None) Column name in data.obs containing protein IDs to match against cRAP database. If None, uses data.obs_names (row index).
- id_type
Literal['uniprot','uniprot_entry_name'] (default:'uniprot') Type of protein identifier to match: - ‘uniprot’: UniProt accession IDs (e.g., P00330) - ‘uniprot_entry_name’: UniProt entry names (e.g., ADH1_YEAST)
- inplace
bool(default:True) Whether to modify data in place or return a copy.
- verbose
bool(default:True) If True, print the list of removed protein IDs. Default is True.
- data
- Return type:
- Returns:
If inplace=False, returns filtered data with cRAP proteins removed.
If inplace=True, modifies data in place and returns None.
Notes
Protein IDs with isoform suffixes (e.g., P00330-1) are automatically cleaned to base accession (P00330) before matching.
If no cRAP proteins are found in the dataset, a warning is issued but the function completes successfully.
The cRAP database is included with grassp. To update it, run: python -m grassp.datasets.marker_curation.update_cRAP
See also
remove_contaminantsRemove contaminants based on custom filter columns.
Examples
Remove cRAP proteins using UniProt IDs from row index:
>>> import grassp as gr >>> adata = gr.datasets.hein_2024(enrichment="raw") >>> adata.shape (8538, 183) >>> gr.pp.remove_cRAP_proteins(adata) >>> adata.shape # Some cRAP proteins removed (8520, 183)