Dataset integration#
import grassp as gr
import scanpy as sc
import anndata as ad
import numpy as np
Loading Data#
For this tutorial we are going to integrate a differential ultracentrifugation dataset (DC) with an Organellar IP dataset (OrgIP). These can be conveniently loaded with the grassp.ds
module:
dc = gr.ds.hek_dc_2025(enrichment="enriched")
dc
AnnData object with n_obs × n_vars = 8599 × 7
obs: 'Majority protein IDs', 'Peptide counts (all)', 'Peptide counts (razor+unique)', 'Peptide counts (unique)', 'Protein names', 'Gene names', 'Fasta headers', 'Number of proteins', 'Peptides', 'Razor + unique peptides', 'Unique peptides', 'Sequence coverage [%]', 'Unique + razor sequence coverage [%]', 'Unique sequence coverage [%]', 'Mol. weight [kDa]', 'Sequence length', 'Sequence lengths', 'Fraction average', 'Fraction 1', 'Fraction 2', 'Fraction 3', 'Q-value', 'Score', 'Intensity', 'iBAQ', 'MS/MS count', 'id', 'Peptide IDs', 'Peptide is razor', 'Mod. peptide IDs', 'Evidence IDs', 'MS/MS IDs', 'Best MS/MS', 'Oxidation (M) site IDs', 'Oxidation (M) site positions', 'n_samples_by_intensity', 'mean_intensity', 'log1p_mean_intensity', 'pct_dropout_by_intensity', 'total_intensity', 'log1p_total_intensity', 'gene_symbol', 'hein2024_component', 'hein2024_gt_component', 'itzhak2016_component'
var: 'subcellular_enrichment', 'n_merged_samples', 'mean', 'std'
uns: 'Search_Engine', 'hein2024_component_colors', 'hein2024_gt_component_colors', 'itzhak2016_component_colors', 'neighbors', 'pca', 'subcellular_enrichment_colors', 'umap'
obsm: 'X_pca', 'X_umap', 'proportion_intensities_imputed', 'unscaled_lfc_enriched_intensities'
varm: 'PCs'
layers: 'MS MS count', 'log1p_intensities', 'log_intensities', 'original_intensities', 'pvals', 'raw_intensities', 'raw_intensities_imputed'
obsp: 'connectivities', 'distances'
orgip = gr.ds.hein_2024(enrichment="enriched")
orgip
AnnData object with n_obs × n_vars = 8538 × 61
obs: 'Majority protein IDs', 'Protein IDs', 'Gene names', 'Gene_name_canonical', 'curated_ground_truth_v9.0', 'cluster_annotation', 'Graph-based_localization_annotation', 'consensus_graph_annnotation', 'gene_symbol', 'hein2024_component', 'hein2024_gt_component', 'itzhak2016_component', 'ComplexName', 'gene_name', 'uniprot_id', 'Protein names', 'Fasta headers', 'Number of proteins', 'Unique + razor sequence coverage [%]', 'Unique sequence coverage [%]', 'Mol. weight [kDa]', 'Sequence length', 'Sequence lengths', 'Fraction average', 'Fraction 1', 'Fraction 101', 'Fraction 102', 'Fraction 103', 'Q-value', 'Score', 'Only identified by site', 'Reverse', 'Potential contaminant', 'id', 'n_samples_by_intensity', 'mean_intensity', 'log1p_mean_intensity', 'pct_dropout_by_intensity', 'total_intensity', 'log1p_total_intensity'
var: 'subcellular_enrichment', 'covariate_Experiment', 'mean', 'std'
uns: 'hein2024_component_colors', 'neighbors', 'umap'
obsm: 'X_umap', 'X_umap_3D_orig', 'X_umap_orig'
layers: 'lfc_enrichment'
obsp: 'connectivities', 'distances'
Plotting individual maps#
We can see how the individual datasets resolve subcellular compartments. DC is a more scalable technique than OrgIP, but provides lower resolution. Here we use UMAPS to get a quick idea of the separation of compartments
sc.pl.umap(orgip, color="hein2024_gt_component")
sc.pl.umap(dc, color="hein2024_gt_component")


Integration#
To integrate the two datasets we need to find identifiers that we can match the datasets on. Candidates could be Uniprot IDs or Gene names. In this case one dataset was searched against SwissProt and the other against TrEMBL, making it hard to merge on uniprot ID. We thus first collaps all entries with the same Gene name
dc = dc[dc.obs["Gene names"].notna()]
dc_agg = gr.pp.aggregate_proteins(dc, grouping_columns="Gene names", agg_func=np.mean)
dc_agg.obs.head()
Majority protein IDs | Peptide counts (all) | Peptide counts (razor+unique) | Peptide counts (unique) | Protein names | Gene names | Fasta headers | Number of proteins | Peptides | Razor + unique peptides | ... | mean_intensity | log1p_mean_intensity | pct_dropout_by_intensity | total_intensity | log1p_total_intensity | gene_symbol | hein2024_component | hein2024_gt_component | itzhak2016_component | n_merged_proteins | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAAS | Q9NRG9;Q9NRG9-2 | 24;20 | 24;20 | 24;20 | Aladin | AAAS | sp|Q9NRG9|AAAS_HUMAN Aladin OS=Homo sapiens OX... | 2 | 24 | 24 | ... | 2193667009.523809 | 21.50884 | 0.0 | 92134023168.0 | 25.24651 | AAAS | Endoplasmic reticulum | NaN | NaN | 1 |
AACS | Q86V21;Q86V21-2;Q86V21-3 | 28;21;15 | 28;21;15 | 28;21;15 | Acetoacetyl-CoA synthetase | AACS | sp|Q86V21|AACS_HUMAN Acetoacetyl-CoA synthetas... | 3 | 28 | 28 | ... | 314310622.095238 | 19.565892 | 2.380952 | 13201046528.0 | 23.303562 | AACS | Cytosol | NaN | NaN | 1 |
AADAT | Q8N5Z0;Q8N5Z0-2 | 9;8 | 9;8 | 9;8 | Kynurenine/alpha-aminoadipate aminotransferase... | AADAT | sp|Q8N5Z0|AADAT_HUMAN Kynurenine/alpha-aminoad... | 2 | 9 | 9 | ... | 52776511.52381 | 17.781577 | 50.0 | 2216613632.0 | 21.519247 | NaN | NaN | NaN | NaN | 1 |
AAED1 | Q7RTV5 | 4 | 4 | 4 | Thioredoxin-like protein AAED1 | AAED1 | sp|Q7RTV5|PXL2C_HUMAN Peroxiredoxin-like 2C OS... | 1 | 4 | 4 | ... | 15247564.285714 | 16.53993 | 47.619048 | 640397696.0 | 20.277599 | NaN | NaN | NaN | NaN | 1 |
AAGAB | Q6PD74;Q6PD74-2 | 10;6 | 10;6 | 10;6 | Alpha- and gamma-adaptin-binding protein p34 | AAGAB | sp|Q6PD74|AAGAB_HUMAN Alpha- and gamma-adaptin... | 2 | 10 | 10 | ... | 478817880.571429 | 19.986831 | 0.0 | 20110352384.0 | 23.724501 | AAGAB | Cytosol | Cytosol | NaN | 1 |
5 rows × 46 columns
orgip_agg = gr.pp.aggregate_proteins(
orgip, grouping_columns="Gene_name_canonical", agg_func=np.mean
)
orgip_agg.obs_names = orgip_agg.obs_names.str.upper()
orgip_agg.obs.head()
Majority protein IDs | Protein IDs | Gene names | Gene_name_canonical | curated_ground_truth_v9.0 | cluster_annotation | Graph-based_localization_annotation | consensus_graph_annnotation | gene_symbol | hein2024_component | ... | Reverse | Potential contaminant | id | n_samples_by_intensity | mean_intensity | log1p_mean_intensity | pct_dropout_by_intensity | total_intensity | log1p_total_intensity | n_merged_proteins | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A0A2R8Y3M9[P] | A0A2R8Y3M9 | A0A2R8Y3M9 | NaN | A0A2R8Y3M9[p] | NaN | unclassified | unclassified | unclassified | NaN | NaN | ... | False | False | 2564 | 14 | 477612.021858 | 13.076556 | 92.349727 | 87403000.0 | 18.286039 | 1 |
A0A3B3ITR4[P] | A0A3B3ITR4 | A0A3B3ITR4;A0A0C4DG23;C9J718;C9JF32;C9JBN7;C9J... | NaN | A0A3B3ITR4[p] | NaN | recycling_endosome | recycling_endosome | recycling_endosome | NaN | NaN | ... | False | False | 2961 | 136 | 8201691.114754 | 15.919851 | 25.68306 | 1500909568.0 | 21.129337 | 1 |
A0A5C2GRJ2[P] | A0A5C2GRJ2 | A0A5C2GRJ2 | NaN | A0A5C2GRJ2[p] | NaN | unclassified | unclassified | unclassified | NaN | NaN | ... | False | False | 3198 | 143 | 4939933.333333 | 15.412863 | 21.857923 | 904007808.0 | 20.622349 | 1 |
A0A7D5Y1P9[P] | A0A7D5Y1P9 | A0A7D5Y1P9 | NaN | A0A7D5Y1P9[p] | NaN | unclassified | unclassified | unclassified | NaN | NaN | ... | False | False | 3409 | 11 | 367062.295082 | 12.81329 | 93.989071 | 67172400.0 | 18.022774 | 1 |
A0A024RBS8[P] | A0A024RBS8 | A0A024RBS8 | hCG_1744452 | A0A024RBS8[p] | NaN | nucleus | nucleus | nucleus | NaN | NaN | ... | False | False | 1262 | 9 | 191690.163934 | 12.163641 | 95.081967 | 35079300.0 | 17.373121 | 1 |
5 rows × 41 columns
combined = ad.concat(
[dc_agg, orgip_agg],
axis=1,
join="inner",
merge="first",
keys=["dc", "orgip"],
label="dataset",
)
combined
AnnData object with n_obs × n_vars = 7014 × 68
obs: 'Majority protein IDs', 'Peptide counts (all)', 'Peptide counts (razor+unique)', 'Peptide counts (unique)', 'Protein names', 'Gene names', 'Fasta headers', 'Number of proteins', 'Peptides', 'Razor + unique peptides', 'Unique peptides', 'Sequence coverage [%]', 'Unique + razor sequence coverage [%]', 'Unique sequence coverage [%]', 'Mol. weight [kDa]', 'Sequence length', 'Sequence lengths', 'Fraction average', 'Fraction 1', 'Fraction 2', 'Fraction 3', 'Q-value', 'Score', 'Intensity', 'iBAQ', 'MS/MS count', 'id', 'Peptide IDs', 'Peptide is razor', 'Mod. peptide IDs', 'Evidence IDs', 'MS/MS IDs', 'Best MS/MS', 'Oxidation (M) site IDs', 'Oxidation (M) site positions', 'n_samples_by_intensity', 'mean_intensity', 'log1p_mean_intensity', 'pct_dropout_by_intensity', 'total_intensity', 'log1p_total_intensity', 'gene_symbol', 'hein2024_component', 'hein2024_gt_component', 'itzhak2016_component', 'n_merged_proteins', 'Protein IDs', 'Gene_name_canonical', 'curated_ground_truth_v9.0', 'cluster_annotation', 'Graph-based_localization_annotation', 'consensus_graph_annnotation', 'ComplexName', 'gene_name', 'uniprot_id', 'Fraction 101', 'Fraction 102', 'Fraction 103', 'Only identified by site', 'Reverse', 'Potential contaminant'
var: 'subcellular_enrichment', 'mean', 'std', 'dataset'
sc.pp.neighbors(combined, use_rep="X")
sc.tl.umap(combined, min_dist=0.1)
sc.pl.umap(combined, color="hein2024_gt_component")
