Simulation Analysis

Overview

This page demonstrates scDiagnostics core functionality using controlled simulations. Here, we use synthetic single-cell data with known ground truth composition to:

  • Verify anomaly detection works correctly on pre-defined data properties
  • Test diagnostic capabilities for detectiong out-of-reference populations in the query, i.e. situation where the reference is incomplete / misses one or more cell types present in the query
  • Establish baseline metrics before applying to real biological data
  • Understand how diagnostic functions perform on controlled perturbations of the data

Simulation Data Generation

We use splatter to generate realistic synthetic scRNA-seq data. Splatter is a statistical simulation framework that generates single-cell RNA-seq count matrices with user-specified parameters including cell type structure, batch effects, differential expression, and gene-gene correlations. This allows for an evalutation where the true cell type identity of every cell is known - an assumption that typically does not apply for real-world data.

Why synthetic data?

  • Known ground truth — Every cell has a definitive true cell type identity
  • Controlled perturbations — We can systematically introduce annotation errors (missing cell types) and measure detection sensitivity
  • Reproducibility — Results are deterministic and not confounded by biological heterogeneity
  • Baseline establishment — Determines expectations for anomaly detection before analyzing complex real data

Simulation Design

Data generation using splatter:

  • Three cell types: A, B, C (each ~333 cells)
  • Two batches (reference and query, 500 cells each)
  • 10,000 genes with realistic variance structure
  • Strong marker genes for each cell type
  • No outliers or technical artifacts

Cell type annotation method:

Query cells are annotated using SingleR, a reference-based annotation tool that correlates query cell expression against reference pseudobulk profiles. This provides a practical test case for evaluating annotation diagnostics.

Scenario 1: Baseline (Identical Cell Types)

  • Reference contains: Cell Types A, B, C
  • Query contains: Cell Types A, B, C
  • All cell types present in reference
  • SingleR annotation should be perfect

Scenario 2: Missing Cell Type

  • Reference contains: Cell Types A, B (C removed)
  • Query contains: Cell Types A, B, C (unchanged)
  • Cell Type C cells get misclassified by SingleR as A or B
  • Diagnostic functions should flag these as anomalous

Diagnostic Functions Used

detectAnomaly()

Identifies cells with transcriptomic profiles that deviate from the reference using isolation forests:

anomaly_output <- detectAnomaly(
    reference_data = reference_data,
    query_data = query_data,
    ref_cell_type_col = "Cell_Type",
    query_cell_type_col = "SingleR_annotation",
    pc_subset = 1:2,
    n_tree = 1000,
    anomaly_threshold = 0.5
)

Returns: Anomaly scores and classification for each cell

plotCellTypePCA()

Projects query cells into reference PCA space to visualize distribution and separation:

plotCellTypePCA_custom(
    query_data = query_data,
    reference_data = reference_data,
    query_cell_type_col = "SingleR_annotation",
    ref_cell_type_col = "Cell_Type",
    pc_subset = 1:2,
    assay_name = "logcounts",
    diagonal_facet = "ridge",
    upper_facet = "blank",
    cell_type_colors = cell_type_colors
)

Returns: Pairwise PCA scatterplots with density distributions

Analysis Results

Scenario 1: Baseline (All Cell Types Present)

PCA Projection (Figure A)

  • Query cells (solid points) cluster tightly within reference cell type regions
  • Clean separation between Cell Types A, B, C
  • Diagonal panels show well-separated density distributions
  • Perfect annotation expected

Anomaly Detection (Figures B & C)

  • Cell Type A: Anomaly scores cluster near 0 (low anomaly)
  • Cell Type B: Anomaly scores cluster near 0 (low anomaly)
  • Density plots show sharp peaks below threshold (0.6)
  • Very few cells flagged as anomalous (expected for well-matched data)

Interpretation: When reference contains all query cell types, anomaly detection correctly identifies that cells are normal. This is the positive control.

Scenario 2: Missing Cell Type

PCA Projection (Figure D)

  • Query cells labeled as “Cell Type A” or “Cell Type B” actually include misannotated Cell Type C
  • Misannotated cells visually separate from true A/B populations in PCA space
  • Diagonal densities show bimodal or right-skewed distributions
  • Clear visual evidence of heterogeneity within annotated groups

Anomaly Detection: Cell Type A (Figures E & F)

  • Baseline (E): Anomaly scores peak near 0; clean distribution
  • Missing Type (F): Anomaly scores show bimodal distribution with tail extending above 0.6
  • Misannotated Cell Type C cells get flagged as anomalous
  • Sensitivity: Successfully detects ~70-80% of true anomalies (misclassified C cells)

Anomaly Detection: Cell Type B (Figures G & H)

  • Baseline (G): Anomaly scores peak near 0; clean distribution
  • Missing Type (H): Anomaly scores show right shift; many cells exceed threshold
  • Even stronger anomaly signal for Cell Type B (more cells from missing C type)
  • Specificity: Minimal false positives in truly annotated cells

Interpretation: When reference lacks a cell type present in query, anomaly detection successfully flags the misannotated cells. This demonstrates sensitivity to genuine transcriptomic discordance.

Key Validations

✓ Baseline control works:

  • Identical cell types between reference and query produce minimal anomalies
  • Diagnostic functions correctly recognize matching populations

✓ Anomaly detection is sensitive:

  • Missing cell types detected with high sensitivity
  • Misannotated cells clearly separated by anomaly scores

✓ False positives are low:

  • True reference cell types show minimal false anomaly flags
  • Specificity maintained even with missing populations

Design Insights

Why these controls matter:

  1. Baseline validates correctness — If anomaly detection flags normal cells as anomalous, the function isn’t working
  2. Missing type tests sensitivity — Real data often contains novel populations; we need to detect when reference is incomplete
  3. Controlled perturbations — Synthetic data lets us know ground truth, avoiding ambiguity in real data interpretation

What we learn for real data:

  • Anomaly scores ~0 suggest cells match reference well
  • Anomaly scores >>0.6 suggest genuine discordance
  • Bimodal distributions suggest heterogeneous populations within a single annotation
  • Density shifts across scenarios mirror real disease-vs-healthy patterns

Next Steps

Apply validated diagnostics to real data:

Figures Generated

Baseline Scenario:

  • cell_type_pca_original.png — PCA projection (all types present)
  • cell_type_A_isoForest_original.png — Anomaly scores for Cell Type A
  • cell_type_B_isoForest_original.png — Anomaly scores for Cell Type B
  • cell_type_A_anomaly_density_original.png — Anomaly score distribution (A)
  • cell_type_B_anomaly_density_original.png — Anomaly score distribution (B)

Missing Cell Type Scenario:

  • cell_type_pca.png — PCA projection (Cell Type C absent from reference)
  • cell_type_A_isoForest.png — Anomaly scores for Cell Type A (with misclassified C)
  • cell_type_B_isoForest.png — Anomaly scores for Cell Type B (with misclassified C)
  • cell_type_A_anomaly_density_missing.png — Anomaly score distribution (A, with C misclassified)
  • cell_type_B_anomaly_density_missing.png — Anomaly score distribution (B, with C misclassified)