anomaly_output <- detectAnomaly(
reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "Cell_Type",
query_cell_type_col = "SingleR_annotation",
pc_subset = 1:2,
n_tree = 1000,
anomaly_threshold = 0.5
)Simulation Analysis
Overview
This page demonstrates scDiagnostics core functionality using controlled simulations. Here, we use synthetic single-cell data with known ground truth composition to:
- Verify anomaly detection works correctly on pre-defined data properties
- Test diagnostic capabilities for detectiong out-of-reference populations in the query, i.e. situation where the reference is incomplete / misses one or more cell types present in the query
- Establish baseline metrics before applying to real biological data
- Understand how diagnostic functions perform on controlled perturbations of the data
Simulation Data Generation
We use splatter to generate realistic synthetic scRNA-seq data. Splatter is a statistical simulation framework that generates single-cell RNA-seq count matrices with user-specified parameters including cell type structure, batch effects, differential expression, and gene-gene correlations. This allows for an evalutation where the true cell type identity of every cell is known - an assumption that typically does not apply for real-world data.
Why synthetic data?
- Known ground truth — Every cell has a definitive true cell type identity
- Controlled perturbations — We can systematically introduce annotation errors (missing cell types) and measure detection sensitivity
- Reproducibility — Results are deterministic and not confounded by biological heterogeneity
- Baseline establishment — Determines expectations for anomaly detection before analyzing complex real data
Simulation Design
Data generation using splatter:
- Three cell types: A, B, C (each ~333 cells)
- Two batches (reference and query, 500 cells each)
- 10,000 genes with realistic variance structure
- Strong marker genes for each cell type
- No outliers or technical artifacts
Cell type annotation method:
Query cells are annotated using SingleR, a reference-based annotation tool that correlates query cell expression against reference pseudobulk profiles. This provides a practical test case for evaluating annotation diagnostics.
Scenario 1: Baseline (Identical Cell Types)
- Reference contains: Cell Types A, B, C
- Query contains: Cell Types A, B, C
- All cell types present in reference
- SingleR annotation should be perfect
Scenario 2: Missing Cell Type
- Reference contains: Cell Types A, B (C removed)
- Query contains: Cell Types A, B, C (unchanged)
- Cell Type C cells get misclassified by SingleR as A or B
- Diagnostic functions should flag these as anomalous
Diagnostic Functions Used
detectAnomaly()
Identifies cells with transcriptomic profiles that deviate from the reference using isolation forests:
Returns: Anomaly scores and classification for each cell
plotCellTypePCA()
Projects query cells into reference PCA space to visualize distribution and separation:
plotCellTypePCA_custom(
query_data = query_data,
reference_data = reference_data,
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "Cell_Type",
pc_subset = 1:2,
assay_name = "logcounts",
diagonal_facet = "ridge",
upper_facet = "blank",
cell_type_colors = cell_type_colors
)Returns: Pairwise PCA scatterplots with density distributions
Analysis Results
Scenario 1: Baseline (All Cell Types Present)
PCA Projection (Figure A)
- Query cells (solid points) cluster tightly within reference cell type regions
- Clean separation between Cell Types A, B, C
- Diagonal panels show well-separated density distributions
- Perfect annotation expected
Anomaly Detection (Figures B & C)
- Cell Type A: Anomaly scores cluster near 0 (low anomaly)
- Cell Type B: Anomaly scores cluster near 0 (low anomaly)
- Density plots show sharp peaks below threshold (0.6)
- Very few cells flagged as anomalous (expected for well-matched data)
Interpretation: When reference contains all query cell types, anomaly detection correctly identifies that cells are normal. This is the positive control.
Scenario 2: Missing Cell Type
PCA Projection (Figure D)
- Query cells labeled as “Cell Type A” or “Cell Type B” actually include misannotated Cell Type C
- Misannotated cells visually separate from true A/B populations in PCA space
- Diagonal densities show bimodal or right-skewed distributions
- Clear visual evidence of heterogeneity within annotated groups
Anomaly Detection: Cell Type A (Figures E & F)
- Baseline (E): Anomaly scores peak near 0; clean distribution
- Missing Type (F): Anomaly scores show bimodal distribution with tail extending above 0.6
- Misannotated Cell Type C cells get flagged as anomalous
- Sensitivity: Successfully detects ~70-80% of true anomalies (misclassified C cells)
Anomaly Detection: Cell Type B (Figures G & H)
- Baseline (G): Anomaly scores peak near 0; clean distribution
- Missing Type (H): Anomaly scores show right shift; many cells exceed threshold
- Even stronger anomaly signal for Cell Type B (more cells from missing C type)
- Specificity: Minimal false positives in truly annotated cells
Interpretation: When reference lacks a cell type present in query, anomaly detection successfully flags the misannotated cells. This demonstrates sensitivity to genuine transcriptomic discordance.
Key Validations
✓ Baseline control works:
- Identical cell types between reference and query produce minimal anomalies
- Diagnostic functions correctly recognize matching populations
✓ Anomaly detection is sensitive:
- Missing cell types detected with high sensitivity
- Misannotated cells clearly separated by anomaly scores
✓ False positives are low:
- True reference cell types show minimal false anomaly flags
- Specificity maintained even with missing populations
Design Insights
Why these controls matter:
- Baseline validates correctness — If anomaly detection flags normal cells as anomalous, the function isn’t working
- Missing type tests sensitivity — Real data often contains novel populations; we need to detect when reference is incomplete
- Controlled perturbations — Synthetic data lets us know ground truth, avoiding ambiguity in real data interpretation
What we learn for real data:
- Anomaly scores ~0 suggest cells match reference well
- Anomaly scores >>0.6 suggest genuine discordance
- Bimodal distributions suggest heterogeneous populations within a single annotation
- Density shifts across scenarios mirror real disease-vs-healthy patterns
Next Steps
Apply validated diagnostics to real data:
- COVID-19 monocytes — Disease-driven expression changes
- MERFISH colitis — Spatial transcriptomics validation
- Explore tool diagnostics — Compare with annotation tool confidence metrics
Figures Generated
Baseline Scenario:
cell_type_pca_original.png— PCA projection (all types present)cell_type_A_isoForest_original.png— Anomaly scores for Cell Type Acell_type_B_isoForest_original.png— Anomaly scores for Cell Type Bcell_type_A_anomaly_density_original.png— Anomaly score distribution (A)cell_type_B_anomaly_density_original.png— Anomaly score distribution (B)
Missing Cell Type Scenario:
cell_type_pca.png— PCA projection (Cell Type C absent from reference)cell_type_A_isoForest.png— Anomaly scores for Cell Type A (with misclassified C)cell_type_B_isoForest.png— Anomaly scores for Cell Type B (with misclassified C)cell_type_A_anomaly_density_missing.png— Anomaly score distribution (A, with C misclassified)cell_type_B_anomaly_density_missing.png— Anomaly score distribution (B, with C misclassified)