COVID-19 PBMC Analysis

Overview

This page demonstrates the application of scDiagnostics to the COVID-19 PBMC data, comparing single-cell measurements from severe COVID-19 patients to a healthy reference. We use a combination of project mapping in reduced dimensional space, anomaly detection, and differential expression analysis to illustrate how scDiagnostics can facilitate the detection and characterization of a disease-associate cell state that would be missed by standard reference-based annotation tools.

Dataset Context

Healthy Reference: 23,201 PBMC cells from 23 healthy patients
COVID-19 Query: 48,148 PBMC cells from 86 severe COVID-19 patients
Focus: Interferon (IFN) response genes and transcriptomic shifts during severe infection

Reproducibility Across Annotation Methods

All results below are reproducible with any annotation method.

Results shown use Azimuth L1 annotations (azimuth_celltype_l1_merged), but identical analyses can be generated by simply changing the query_cell_type_col parameter to:

"singler_annotations_merged" — SingleR correlation-based annotations
"celltypist_predicted_labels_merged" — CellTypist machine learning predictions
"scvi_prediction_merged" — scArches deep learning transfer

We consistently observe the same patterns of anomaly detection, ECM dysregulation, and elevated IFN gene expression across all four methods, confirming that results reflect genuine disease-driven transcriptomic remodeling rather than artifacts from individual annotationtools.

Analysis Components

1. UMAP Visualization (Figure A)

Script: R/covid/UMAP_Code_scVI.R

Three complementary UMAP visualizations of the joint reference-query space:

A) Disease Status UMAP (Left)

Healthy reference cells (blue) cluster tightly in the lower left region
COVID-19 query cells (pink) form a separate, larger cluster to the right
Clear spatial segregation indicates substantial transcriptomic remodeling during severe infection

B) Cell Type UMAP (Center)

Cell type annotations are preserved across both datasets
Reference landscape: B cells, Plasma cells, HSPCs, and diverse T/NK populations
Immune composition remains similar despite disease state

C) IFN Signature UMAP (Right)

Dramatic difference in interferon-stimulated gene expression
Reference cells: predominantly low IFN scores (dark purple)
Query cells: predominantly high IFN scores (yellow/orange)
This uniform, elevated IFN activation marks COVID-19 cells as a distinct transcriptomic state

Note: Alternative dimensionality reduction methods (Fast MNN and Harmony) are also available in UMAP_Code_FastMNN.R and UMAP_Code_Harmony.R.

2. PCA with IFN Signature (Figure B)

Script: R/covid/PCA_IFN_Plot.R

Pairwise scatterplot matrix of first 3 PCs with:

Diagonal: Density distributions for reference (blue) vs. query (purple)
Lower panels: PCA scatter plots colored by IFN signature score
- Reference cells (hollow circles) cluster tightly in negative PC space with low IFN scores
- Query cells (solid circles) scatter across positive PC space with elevated IFN scores
Upper panels: Blank for clarity

Left panel (Reference): Homogeneous reference population with minimal variation; IFN activation is baseline

Right panel (Query): Heterogeneous distribution of COVID-19 cells; IFN activation drives separation from reference in PC space

3. Anomaly Detection (Figure B, Right)

Script: R/covid/PCA_Anomaly_Plot.R

Identifies cells with transcriptomic profiles substantially different from the reference:

Anomalous cells (red): ~50% of COVID-19 cells with transcriptional patterns not well-represented in healthy reference
Non-anomalous cells (green/black): ~50% with relatively normal transcriptional structure
Reference cells (shown left) have minimal anomalies

Key finding: A substantial and systematic fraction of COVID-19 cells are flagged as anomalous, indicating genuine disease-driven transcriptomic remodeling beyond simple continuous variation.

Function Details: detectAnomaly()

anomaly_output <- detectAnomaly(
    query_data = covid_data,
    reference_data = normal_data,
    query_cell_type_col = "azimuth_celltype_l1_merged",
    ref_cell_type_col = "author_cell_type_merged",
    cell_types = c("CD14 mono"),
    pc_subset = 1:5,
    n_tree = 500,
    anomaly_threshold = 0.5,
    assay_name = "logcounts",
    max_cells_query = NULL,
    max_cells_ref = NULL
)

# Extract results for CD14 monocytes
query_anomaly_status <- anomaly_output[["CD14 mono"]][["query_anomaly"]]
reference_anomaly_status <- anomaly_output[["CD14 mono"]][["reference_anomaly"]]

Parameters: - query_data, reference_data — SCE objects to compare - query_cell_type_col, ref_cell_type_col — Column names containing cell type labels - cell_types — Specific cell type(s) to analyze - pc_subset — Principal components to use (1:5 = first 5 PCs) - n_tree — Number of trees for isolation forest (500 = balance speed/accuracy) - anomaly_threshold — Score threshold for anomaly classification (0.5 = 50th percentile)

Output: Logical vectors indicating which cells are anomalous in each dataset

Flexibility note: You can change query_cell_type_col to use different annotation systems: - "azimuth_celltype_l1_merged" (shown here) - "singler_annotations_merged" - "celltypist_predicted_labels_merged" - "scvi_prediction_merged"

We observe consistent anomaly detection patterns across all four annotation methods, confirming that results reflect genuine biological divergence rather than method-specific artifacts.

4. Gene Expression Shifts (Figure C)

Script: R/covid/Heatmap_and_Barplot_Code.R

Analyzes 25 interferon-stimulated genes to quantify transcriptomic shifts:

C) Heatmap (Left)

Rows: 25 interferon-stimulated genes (IFN signature)
Columns: Three comparisons (gray brackets at top)
- Query vs Reference (all)
- Query non-anomalous vs Reference
- Query anomalous vs Reference
Color: Blue (low) to Red (high expression)
Reference cells show uniformly low expression (blue)
Query cells show dramatically elevated expression (red)
Anomalous query cells show the most extreme upregulation

C) Barplot (Right)

Individual log2-fold changes for each IFN gene
Gray bars: All query vs reference (median ~0.8 log2-fold)
Red bars: Anomalous vs reference (median ~1.5 log2-fold)
Light red bars: Non-anomalous vs reference (median ~0.5 log2-fold)
Nearly all 25 genes show statistically significant upregulation

Key observations:

Uniform IFN activation: All 25 genes consistently upregulated; no selective subset
Anomaly stratification: Anomalous cells show 2-3 fold greater expression increase than non-anomalous cells
Strongest signals: LY6E, ISG15, and MX1 show the most dramatic fold changes (>1.5 log2-fold in anomalous)
Statistical significance: Paired t-tests show highly significant differences (p < 0.001) between anomalous and non-anomalous populations; Cohen’s d indicates large effect sizes

Function Details: calculateGeneShifts()

gene_shifts <- calculateGeneShifts(
    query_data = covid_data[yoshida_ifn_signature, ],
    reference_data = normal_data[yoshida_ifn_signature, ],
    query_cell_type_col = "azimuth_celltype_l1_merged",
    ref_cell_type_col = "author_cell_type_merged",
    cell_types = c("CD14 mono"),
    pc_subset = 1:5,
    n_top_loadings = 30,
    assay_name = "logcounts",
    p_value_threshold = 0.05,
    adjust_method = "fdr",
    detect_anomalies = TRUE,
    anomaly_comparison = TRUE,
    anomaly_threshold = 0.5,
    max_cells_query = 5000,
    max_cells_ref = 5000
)

# Plot results
plot(
    gene_shifts,
    cell_type = "CD14 mono",
    pc_subset = 1:5,
    plot_type = "heatmap",
    plot_by = "p_adjusted",
    n_genes = 25,
    significance_threshold = 0.05,
    show_anomalies = TRUE,
    pseudo_bulk = TRUE,
    max_cells_query = 5000,
    max_cells_ref = 5000
)

Parameters: - query_data, reference_data — SCE objects (subset to genes of interest) - query_cell_type_col, ref_cell_type_col — Cell type labels - cell_types — Specific cell type to analyze - pc_subset — Principal components determining gene importance - n_top_loadings — Top genes by PC loadings to consider - detect_anomalies — Include anomaly detection (TRUE/FALSE) - anomaly_comparison — Separate results by anomaly status (TRUE/FALSE) - max_cells_* — Subsample for computational efficiency

Output: Gene shift statistics with p-values, fold changes, and anomaly-stratified comparisons

Flexibility note: All four annotation systems show similar gene shift patterns. Changing query_cell_type_col to any of the merged columns produces consistent results, validating that observed transcriptomic shifts are robust biological signals independent of annotation method.

Interpretation

COVID-19 cells exhibit:

Spatial separation — Distinct clustering from healthy reference in embedding space
Uniform IFN activation — All 25 interferon genes co-regulated; reflects coordinated antiviral response
High anomaly rates — High % of cells flagged as transcriptomically divergent from reference
Stratified dysregulation — Anomalous cells show extreme IFN upregulation (1.5-2.0 log2-fold) vs. non-anomalous cells (~0.5 log2-fold)

Implications for annotation:

Distance-based methods may struggle with cells showing this degree of transcriptomic shift
Anomaly detection provides actionable signal — Identifies the most dysregulated cells independent of annotation tool
Built-in confidence scores may be insufficient — Uniform IFN activation suggests systematic, coordinated remodeling that confidence scores may underestimate
Biological signal is robust — Consistent across multiple diagnostic approaches (PCA, anomaly detection, gene shifts) and reproducible with all four annotation methods

Figures Generated

scvi_umap_disease.png — UMAP by disease status
scvi_umap_celltype.png — UMAP by cell type
scvi_umap_ifn.png — UMAP by IFN signature
pca_ifn.png — PCA pairs plot with IFN scores and anomaly detection
anomaly_plot.png — PCA anomaly detection results
covid_ifn_heatmap.png — Gene shift heatmap
gene_shifts_barplot.png — Gene shift barplot