grassp.pp.impute_gaussian

grassp.pp.impute_gaussian#

impute_gaussian(data, width=0.3, distance=1.8, per_sample=True, random_state=0, inplace=True)[source]#

Impute missing values using a Gaussian distribution.

This function imputes missing values (zeros) in the data matrix using a Gaussian distribution. The parameters of the Gaussian are derived from the observed (non-zero) values, with the mean shifted downward by a specified number of standard deviations.

Parameters:
data AnnData

Annotated data matrix with proteins as observations (rows).

width float (default: 0.3)

Width of the Gaussian distribution as a fraction of the standard deviation of observed values.

distance float (default: 1.8)

Downward shift of the mean in standard deviations.

per_sample bool (default: True)

If True, computes parameters for each sample separately.

random_state Union[int, RandomState, None] (default: 0)

Seed for the random number generator.

inplace bool (default: True)

If True, modifies data inplace.

Return type:

ndarray | None

Returns:

numpy.ndarray or None If inplace=False, returns the imputed data matrix. If inplace=True, returns None and modifies the input data.

Notes

This implements a simple but effective imputation strategy commonly used in proteomics data analysis. Missing values are assumed to be below detection limit and are imputed from a Gaussian distribution with parameters derived from the observed values but shifted downward.

Examples

>>> import grassp as gr
>>> import numpy as np
>>> import scanpy as sc
>>> adata = gr.datasets.hein_2024(enrichment="raw")
>>> sc.pp.log1p(adata)
>>> int(np.sum(adata.X == 0))  # Count missing values (zeros)
747946
>>> gr.pp.impute_gaussian(adata, width=0.3, distance=1.8)
>>> int(np.sum(adata.X == 0))  # No more missing values
0
>>> int(adata.obs['n_imputed'].sum())  # Total imputed values
747946