Simulation Analysis

Overview

This page demonstrates scDiagnostics core functionality using controlled simulations. Here, we use synthetic single-cell data with known ground truth composition to:

Verify anomaly detection works correctly on pre-defined data properties
Test diagnostic capabilities for detectiong out-of-reference populations in the query, i.e. situation where the reference is incomplete / misses one or more cell types present in the query
Establish baseline metrics before applying to real biological data
Understand how diagnostic functions perform on controlled perturbations of the data

Simulation Data Generation

We use splatter to generate realistic synthetic scRNA-seq data. Splatter is a statistical simulation framework that generates single-cell RNA-seq count matrices with user-specified parameters including cell type structure, batch effects, differential expression, and gene-gene correlations. This allows for an evalutation where the true cell type identity of every cell is known - an assumption that typically does not apply for real-world data.

Why synthetic data?

Known ground truth — Every cell has a definitive true cell type identity
Controlled perturbations — We can systematically introduce annotation errors (missing cell types) and measure detection sensitivity
Reproducibility — Results are deterministic and not confounded by biological heterogeneity
Baseline establishment — Determines expectations for anomaly detection before analyzing complex real data

Simulation Design

Data generation using splatter:

Three cell types: A, B, C (each ~333 cells)
Two batches (reference and query, 500 cells each)
10,000 genes with realistic variance structure
Strong marker genes for each cell type
No outliers or technical artifacts

Cell type annotation method:

Query cells are annotated using SingleR, a reference-based annotation tool that correlates query cell expression against reference pseudobulk profiles. This provides a practical test case for evaluating annotation diagnostics.

Scenario 1: Baseline (Identical Cell Types)

Reference contains: Cell Types A, B, C
Query contains: Cell Types A, B, C
All cell types present in reference
SingleR annotation should be perfect

Scenario 2: Missing Cell Type

Reference contains: Cell Types A, B (C removed)
Query contains: Cell Types A, B, C (unchanged)
Cell Type C cells get misclassified by SingleR as A or B
Diagnostic functions should flag these as anomalous

Diagnostic Functions Used

detectAnomaly()

Identifies cells with transcriptomic profiles that deviate from the reference using isolation forests:

anomaly_output <- detectAnomaly(
    reference_data = reference_data,
    query_data = query_data,
    ref_cell_type_col = "Cell_Type",
    query_cell_type_col = "SingleR_annotation",
    pc_subset = 1:2,
    n_tree = 1000,
    anomaly_threshold = 0.5
)

Returns: Anomaly scores and classification for each cell

plotCellTypePCA()

Projects query cells into reference PCA space to visualize distribution and separation:

plotCellTypePCA_custom(
    query_data = query_data,
    reference_data = reference_data,
    query_cell_type_col = "SingleR_annotation",
    ref_cell_type_col = "Cell_Type",
    pc_subset = 1:2,
    assay_name = "logcounts",
    diagonal_facet = "ridge",
    upper_facet = "blank",
    cell_type_colors = cell_type_colors
)

Returns: Pairwise PCA scatterplots with density distributions

Analysis Results

Scenario 1: Baseline (All Cell Types Present)

PCA Projection (Figure A)

Query cells (solid points) cluster tightly within reference cell type regions
Clean separation between Cell Types A, B, C
Diagonal panels show well-separated density distributions
Perfect annotation expected

Anomaly Detection (Figures B & C)

Cell Type A: Anomaly scores cluster near 0 (low anomaly)
Cell Type B: Anomaly scores cluster near 0 (low anomaly)
Density plots show sharp peaks below threshold (0.6)
Very few cells flagged as anomalous (expected for well-matched data)

Interpretation: When reference contains all query cell types, anomaly detection correctly identifies that cells are normal. This is the positive control.

Scenario 2: Missing Cell Type

PCA Projection (Figure D)

Query cells labeled as “Cell Type A” or “Cell Type B” actually include misannotated Cell Type C
Misannotated cells visually separate from true A/B populations in PCA space
Diagonal densities show bimodal or right-skewed distributions
Clear visual evidence of heterogeneity within annotated groups

Anomaly Detection: Cell Type A (Figures E & F)

Baseline (E): Anomaly scores peak near 0; clean distribution
Missing Type (F): Anomaly scores show bimodal distribution with tail extending above 0.6
Misannotated Cell Type C cells get flagged as anomalous
Sensitivity: Successfully detects ~70-80% of true anomalies (misclassified C cells)

Anomaly Detection: Cell Type B (Figures G & H)

Baseline (G): Anomaly scores peak near 0; clean distribution
Missing Type (H): Anomaly scores show right shift; many cells exceed threshold
Even stronger anomaly signal for Cell Type B (more cells from missing C type)
Specificity: Minimal false positives in truly annotated cells

Interpretation: When reference lacks a cell type present in query, anomaly detection successfully flags the misannotated cells. This demonstrates sensitivity to genuine transcriptomic discordance.

Key Validations

✓ Baseline control works:

Identical cell types between reference and query produce minimal anomalies
Diagnostic functions correctly recognize matching populations

✓ Anomaly detection is sensitive:

Missing cell types detected with high sensitivity
Misannotated cells clearly separated by anomaly scores

✓ False positives are low:

True reference cell types show minimal false anomaly flags
Specificity maintained even with missing populations

Design Insights

Why these controls matter:

Baseline validates correctness — If anomaly detection flags normal cells as anomalous, the function isn’t working
Missing type tests sensitivity — Real data often contains novel populations; we need to detect when reference is incomplete
Controlled perturbations — Synthetic data lets us know ground truth, avoiding ambiguity in real data interpretation

What we learn for real data:

Anomaly scores ~0 suggest cells match reference well
Anomaly scores >>0.6 suggest genuine discordance
Bimodal distributions suggest heterogeneous populations within a single annotation
Density shifts across scenarios mirror real disease-vs-healthy patterns

Next Steps

Apply validated diagnostics to real data:

COVID-19 monocytes — Disease-driven expression changes
MERFISH colitis — Spatial transcriptomics validation
Explore tool diagnostics — Compare with annotation tool confidence metrics

Figures Generated

Baseline Scenario:

cell_type_pca_original.png — PCA projection (all types present)
cell_type_A_isoForest_original.png — Anomaly scores for Cell Type A
cell_type_B_isoForest_original.png — Anomaly scores for Cell Type B
cell_type_A_anomaly_density_original.png — Anomaly score distribution (A)
cell_type_B_anomaly_density_original.png — Anomaly score distribution (B)

Missing Cell Type Scenario:

cell_type_pca.png — PCA projection (Cell Type C absent from reference)
cell_type_A_isoForest.png — Anomaly scores for Cell Type A (with misclassified C)
cell_type_B_isoForest.png — Anomaly scores for Cell Type B (with misclassified C)
cell_type_A_anomaly_density_missing.png — Anomaly score distribution (A, with C misclassified)
cell_type_B_anomaly_density_missing.png — Anomaly score distribution (B, with C misclassified)