Dataset integration

Contents

Dataset integration#

import grassp as gr
import scanpy as sc
import anndata as ad
import numpy as np

Loading Data#

For this tutorial we are going to integrate a differential ultracentrifugation dataset (DC) with an Organellar IP dataset (OrgIP). These can be conveniently loaded with the grassp.ds module:

dc = gr.ds.hek_dc_2025(enrichment="enriched")
dc

AnnData object with n_obs × n_vars = 8599 × 7
    obs: 'Majority protein IDs', 'Peptide counts (all)', 'Peptide counts (razor+unique)', 'Peptide counts (unique)', 'Protein names', 'Gene names', 'Fasta headers', 'Number of proteins', 'Peptides', 'Razor + unique peptides', 'Unique peptides', 'Sequence coverage [%]', 'Unique + razor sequence coverage [%]', 'Unique sequence coverage [%]', 'Mol. weight [kDa]', 'Sequence length', 'Sequence lengths', 'Fraction average', 'Fraction 1', 'Fraction 2', 'Fraction 3', 'Q-value', 'Score', 'Intensity', 'iBAQ', 'MS/MS count', 'id', 'Peptide IDs', 'Peptide is razor', 'Mod. peptide IDs', 'Evidence IDs', 'MS/MS IDs', 'Best MS/MS', 'Oxidation (M) site IDs', 'Oxidation (M) site positions', 'n_samples_by_intensity', 'mean_intensity', 'log1p_mean_intensity', 'pct_dropout_by_intensity', 'total_intensity', 'log1p_total_intensity', 'gene_symbol', 'hein2024_component', 'hein2024_gt_component', 'itzhak2016_component'
    var: 'subcellular_enrichment', 'n_merged_samples', 'mean', 'std'
    uns: 'Search_Engine', 'hein2024_component_colors', 'hein2024_gt_component_colors', 'itzhak2016_component_colors', 'neighbors', 'pca', 'subcellular_enrichment_colors', 'umap'
    obsm: 'X_pca', 'X_umap', 'proportion_intensities_imputed', 'unscaled_lfc_enriched_intensities'
    varm: 'PCs'
    layers: 'MS MS count', 'log1p_intensities', 'log_intensities', 'original_intensities', 'pvals', 'raw_intensities', 'raw_intensities_imputed'
    obsp: 'connectivities', 'distances'

orgip = gr.ds.hein_2024(enrichment="enriched")
orgip

AnnData object with n_obs × n_vars = 8538 × 61
    obs: 'Majority protein IDs', 'Protein IDs', 'Gene names', 'Gene_name_canonical', 'curated_ground_truth_v9.0', 'cluster_annotation', 'Graph-based_localization_annotation', 'consensus_graph_annnotation', 'gene_symbol', 'hein2024_component', 'hein2024_gt_component', 'itzhak2016_component', 'ComplexName', 'gene_name', 'uniprot_id', 'Protein names', 'Fasta headers', 'Number of proteins', 'Unique + razor sequence coverage [%]', 'Unique sequence coverage [%]', 'Mol. weight [kDa]', 'Sequence length', 'Sequence lengths', 'Fraction average', 'Fraction 1', 'Fraction 101', 'Fraction 102', 'Fraction 103', 'Q-value', 'Score', 'Only identified by site', 'Reverse', 'Potential contaminant', 'id', 'n_samples_by_intensity', 'mean_intensity', 'log1p_mean_intensity', 'pct_dropout_by_intensity', 'total_intensity', 'log1p_total_intensity'
    var: 'subcellular_enrichment', 'covariate_Experiment', 'mean', 'std'
    uns: 'hein2024_component_colors', 'neighbors', 'umap'
    obsm: 'X_umap', 'X_umap_3D_orig', 'X_umap_orig'
    layers: 'lfc_enrichment'
    obsp: 'connectivities', 'distances'

Plotting individual maps#

We can see how the individual datasets resolve subcellular compartments. DC is a more scalable technique than OrgIP, but provides lower resolution. Here we use UMAPS to get a quick idea of the separation of compartments

sc.pl.umap(orgip, color="hein2024_gt_component")
sc.pl.umap(dc, color="hein2024_gt_component")

../../_images/462932443e4b349f6f34baa3991ac780fa917c424fba23f491e4d8b2b1d33ba1.png

../../_images/f63b1088d5caf201277b05866703c1623b7945d869171448424eaf0850d06abf.png

Integration#

To integrate the two datasets we need to find identifiers that we can match the datasets on. Candidates could be Uniprot IDs or Gene names. In this case one dataset was searched against SwissProt and the other against TrEMBL, making it hard to merge on uniprot ID. We thus first collaps all entries with the same Gene name

dc = dc[dc.obs["Gene names"].notna()]
dc_agg = gr.pp.aggregate_proteins(dc, grouping_columns="Gene names", agg_func=np.mean)
dc_agg.obs.head()

	Majority protein IDs	Peptide counts (all)	Peptide counts (razor+unique)	Peptide counts (unique)	Protein names	Gene names	Fasta headers	Number of proteins	Peptides	Razor + unique peptides	...	mean_intensity	log1p_mean_intensity	pct_dropout_by_intensity	total_intensity	log1p_total_intensity	gene_symbol	hein2024_component	hein2024_gt_component	itzhak2016_component	n_merged_proteins
AAAS	Q9NRG9;Q9NRG9-2	24;20	24;20	24;20	Aladin	AAAS	sp\|Q9NRG9\|AAAS_HUMAN Aladin OS=Homo sapiens OX...	2	24	24	...	2193667009.523809	21.50884	0.0	92134023168.0	25.24651	AAAS	Endoplasmic reticulum	NaN	NaN	1
AACS	Q86V21;Q86V21-2;Q86V21-3	28;21;15	28;21;15	28;21;15	Acetoacetyl-CoA synthetase	AACS	sp\|Q86V21\|AACS_HUMAN Acetoacetyl-CoA synthetas...	3	28	28	...	314310622.095238	19.565892	2.380952	13201046528.0	23.303562	AACS	Cytosol	NaN	NaN	1
AADAT	Q8N5Z0;Q8N5Z0-2	9;8	9;8	9;8	Kynurenine/alpha-aminoadipate aminotransferase...	AADAT	sp\|Q8N5Z0\|AADAT_HUMAN Kynurenine/alpha-aminoad...	2	9	9	...	52776511.52381	17.781577	50.0	2216613632.0	21.519247	NaN	NaN	NaN	NaN	1
AAED1	Q7RTV5	4	4	4	Thioredoxin-like protein AAED1	AAED1	sp\|Q7RTV5\|PXL2C_HUMAN Peroxiredoxin-like 2C OS...	1	4	4	...	15247564.285714	16.53993	47.619048	640397696.0	20.277599	NaN	NaN	NaN	NaN	1
AAGAB	Q6PD74;Q6PD74-2	10;6	10;6	10;6	Alpha- and gamma-adaptin-binding protein p34	AAGAB	sp\|Q6PD74\|AAGAB_HUMAN Alpha- and gamma-adaptin...	2	10	10	...	478817880.571429	19.986831	0.0	20110352384.0	23.724501	AAGAB	Cytosol	Cytosol	NaN	1

5 rows × 46 columns

orgip_agg = gr.pp.aggregate_proteins(
    orgip, grouping_columns="Gene_name_canonical", agg_func=np.mean
)
orgip_agg.obs_names = orgip_agg.obs_names.str.upper()
orgip_agg.obs.head()

	Majority protein IDs	Protein IDs	Gene names	Gene_name_canonical	curated_ground_truth_v9.0	cluster_annotation	Graph-based_localization_annotation	consensus_graph_annnotation	gene_symbol	hein2024_component	...	Reverse	Potential contaminant	id	n_samples_by_intensity	mean_intensity	log1p_mean_intensity	pct_dropout_by_intensity	total_intensity	log1p_total_intensity	n_merged_proteins
A0A2R8Y3M9[P]	A0A2R8Y3M9	A0A2R8Y3M9	NaN	A0A2R8Y3M9[p]	NaN	unclassified	unclassified	unclassified	NaN	NaN	...	False	False	2564	14	477612.021858	13.076556	92.349727	87403000.0	18.286039	1
A0A3B3ITR4[P]	A0A3B3ITR4	A0A3B3ITR4;A0A0C4DG23;C9J718;C9JF32;C9JBN7;C9J...	NaN	A0A3B3ITR4[p]	NaN	recycling_endosome	recycling_endosome	recycling_endosome	NaN	NaN	...	False	False	2961	136	8201691.114754	15.919851	25.68306	1500909568.0	21.129337	1
A0A5C2GRJ2[P]	A0A5C2GRJ2	A0A5C2GRJ2	NaN	A0A5C2GRJ2[p]	NaN	unclassified	unclassified	unclassified	NaN	NaN	...	False	False	3198	143	4939933.333333	15.412863	21.857923	904007808.0	20.622349	1
A0A7D5Y1P9[P]	A0A7D5Y1P9	A0A7D5Y1P9	NaN	A0A7D5Y1P9[p]	NaN	unclassified	unclassified	unclassified	NaN	NaN	...	False	False	3409	11	367062.295082	12.81329	93.989071	67172400.0	18.022774	1
A0A024RBS8[P]	A0A024RBS8	A0A024RBS8	hCG_1744452	A0A024RBS8[p]	NaN	nucleus	nucleus	nucleus	NaN	NaN	...	False	False	1262	9	191690.163934	12.163641	95.081967	35079300.0	17.373121	1

5 rows × 41 columns

combined = ad.concat(
    [dc_agg, orgip_agg],
    axis=1,
    join="inner",
    merge="first",
    keys=["dc", "orgip"],
    label="dataset",
)
combined

AnnData object with n_obs × n_vars = 7014 × 68
    obs: 'Majority protein IDs', 'Peptide counts (all)', 'Peptide counts (razor+unique)', 'Peptide counts (unique)', 'Protein names', 'Gene names', 'Fasta headers', 'Number of proteins', 'Peptides', 'Razor + unique peptides', 'Unique peptides', 'Sequence coverage [%]', 'Unique + razor sequence coverage [%]', 'Unique sequence coverage [%]', 'Mol. weight [kDa]', 'Sequence length', 'Sequence lengths', 'Fraction average', 'Fraction 1', 'Fraction 2', 'Fraction 3', 'Q-value', 'Score', 'Intensity', 'iBAQ', 'MS/MS count', 'id', 'Peptide IDs', 'Peptide is razor', 'Mod. peptide IDs', 'Evidence IDs', 'MS/MS IDs', 'Best MS/MS', 'Oxidation (M) site IDs', 'Oxidation (M) site positions', 'n_samples_by_intensity', 'mean_intensity', 'log1p_mean_intensity', 'pct_dropout_by_intensity', 'total_intensity', 'log1p_total_intensity', 'gene_symbol', 'hein2024_component', 'hein2024_gt_component', 'itzhak2016_component', 'n_merged_proteins', 'Protein IDs', 'Gene_name_canonical', 'curated_ground_truth_v9.0', 'cluster_annotation', 'Graph-based_localization_annotation', 'consensus_graph_annnotation', 'ComplexName', 'gene_name', 'uniprot_id', 'Fraction 101', 'Fraction 102', 'Fraction 103', 'Only identified by site', 'Reverse', 'Potential contaminant'
    var: 'subcellular_enrichment', 'mean', 'std', 'dataset'

sc.pp.neighbors(combined, use_rep="X")
sc.tl.umap(combined, min_dist=0.1)

sc.pl.umap(combined, color="hein2024_gt_component")

../../_images/71e96998c4edbe33b0fc74c2c36effc6d25631c8f27e5e74519d897ef246c419.png