# Dataset integration

In [None]:
import grassp as gr
import scanpy as sc
import anndata as ad
import numpy as np

## Loading Data

For this tutorial we are going to integrate a differential ultracentrifugation dataset (DC) with an Organellar IP dataset (OrgIP). These can be conveniently loaded with the `grassp.ds` module:

In [None]:
dc = gr.ds.hek_dc_2025(enrichment="enriched")
dc

In [None]:
orgip = gr.ds.hein_2024(enrichment="enriched")
orgip

## Plotting individual maps

We can see how the individual datasets resolve subcellular compartments. DC is a more scalable technique than OrgIP, but provides lower resolution. Here we use UMAPS to get a quick idea of the separation of compartments

In [None]:
sc.pl.umap(orgip, color="hein2024_gt_component")
sc.pl.umap(dc, color="hein2024_gt_component")

## Integration

To integrate the two datasets we need to find identifiers that we can match the datasets on. Candidates could be Uniprot IDs or Gene names. In this case one dataset was searched against SwissProt and the other against TrEMBL, making it hard to merge on uniprot ID. We thus first collaps all entries with the same Gene name

In [None]:
dc = dc[dc.obs["Gene names"].notna()]
dc_agg = gr.pp.aggregate_proteins(dc, grouping_columns="Gene names", agg_func=np.mean)
dc_agg.obs.head()

In [None]:
orgip_agg = gr.pp.aggregate_proteins(
    orgip, grouping_columns="Gene_name_canonical", agg_func=np.mean
)
orgip_agg.obs_names = orgip_agg.obs_names.str.upper()
orgip_agg.obs.head()

In [None]:
combined = ad.concat(
    [dc_agg, orgip_agg],
    axis=1,
    join="inner",
    merge="first",
    keys=["dc", "orgip"],
    label="dataset",
)
combined

In [None]:
sc.pp.neighbors(combined, use_rep="X")
sc.tl.umap(combined, min_dist=0.1)


In [None]:
sc.pl.umap(combined, color="hein2024_gt_component")