Cell type annotation

Overview

This page describes how to predict cell type labels with four popular annotation tools (Azimuth, SingleR, CellTypist, and scArches) for the preprocessed data. The workflow applies these tools to assign cell types to cells in the query dataset based on the cell type categories in the reference data.

All four annotation tools are applied to the same query and reference datasets and can be directly assessed with the diagnostic functionality from scDiagnostics.

Workflow Summary

Irrespective of the annotation tool, the workflow always consists of the following steps:

  1. Data Preparation — Format data for the annotation tool
  2. Run Annotation — Apply the annotation method
  3. Results Integration — Add predictions back to your SCE object

For some methods (Azimuth, CellTypist, scArches), a preliminary step creates a reference or sets up the environment.


Method 1: Azimuth (R-based)

What it does: Performs reference-based mapping using weighted nearest neighbors after dimensionality reduction.

Step 1a: Create Azimuth Reference (Run Once)

Before annotating query data, you must create an Azimuth reference from your healthy reference dataset.

Script: R/covid/Create_Azimuth_Reference.R

Key steps:

  • Loads healthy reference data
  • Runs SCTransform (standard normalization)
  • Computes PCA (50 dimensions)
  • Generates UMAP in reference space
  • Creates searchable Azimuth reference index

Key parameters:

  • 3,000 variable features (for reproducibility)
  • 50 PCA dimensions
  • 300 UMAP epochs (optimized for efficiency)
  • 100 k-neighbors for robustness

Output: Creates data/covid/Azimuth/custom_azimuth_reference/ with reference files and metadata.

Note: This step only needs to be run once per reference dataset.

Step 1b: Annotate Query Data

Scripts:

  • R/auxiliary/performAzimuthAnnotation.R — Core annotation functionality
  • R/covid/Add_R_Annotations.R — Main execution script

Key features:

  • Integrates query into reference PCA space
  • Weighted k-NN voting for label transfer
  • Fast and memory-efficient
  • Provides both L1 (broad) and L2 (fine-grained) annotations
  • Mapping quality scores included

Output columns added to SCE:

  • azimuth_celltype_l1 / azimuth_celltype_l2 — Predicted labels
  • azimuth_celltype_l1_merged / azimuth_celltype_l2_merged — Collapsed categories
  • azimuth_score_l1 / azimuth_score_l2 — Prediction confidence scores
  • azimuth_mapping_score — Overall mapping quality

Method 2: SingleR (R-based)

What it does: Correlates cell expression profiles against reference data to assign cell types.

Scripts:

  • R/auxiliary/performSingleRWithSubsampling.R — Core annotation function with smart subsampling
  • R/covid/Add_R_Annotations.R — Main execution script (also runs SingleR)

Key features:

  • Uses Spearman correlation and differential expression analysis
  • Conservative subsampling: preserves rare cell types while sampling common types proportionally
  • Produces confidence scores via “delta scores”
  • No separate reference preparation needed

Output columns added to SCE:

  • singler_annotations — Predicted cell type labels
  • singler_annotations_merged — Collapsed to broader categories
  • singler_scores — Confidence scores (delta scores)

Method 3: CellTypist (Python-based)

What it does: Uses a trained logistic regression classifier to predict cell types.

Scripts (in order):

  1. R/auxiliary/environmentSetupCellTypist.R — Sets up Python environment (conda)
  2. R/covid/CellTypist_Data_Preparation.R — Converts SCE to AnnData format (.h5ad)
  3. python/covid/CellTypist_Annotation.py — Trains classifier and annotates query cells (Python)
  4. R/covid/CellTypist_Results_Integration.R — Integrates predictions back into SCE

Workflow:

Step 1: Prepare data (R → h5ad)
Step 2: Train and annotate (Python)
Step 3: Integrate results (h5ad → R)

Key features:

  • Machine learning-based approach
  • Requires Python environment with scanpy and celltypist
  • Memory-efficient data handling (removes non-essential metadata)
  • Produces confidence scores for each prediction

Output columns added to SCE:

  • celltypist_predicted_labels — Predicted cell type
  • celltypist_predicted_labels_merged — Collapsed to broader categories
  • celltypist_conf_score — Classification confidence (0-1)

Method 4: scArches (Python-based)

What it does: Uses variational autoencoders (VAE) to learn a shared latent space and transfers annotations via weighted k-NN in that space.

Scripts (in order):

  1. R/covid/scVI_Data_Preparation.R — Selects HVGs, converts to AnnData format
  2. python/covid/scVI_Annotation.py — Trains VAE on reference, maps query, performs k-NN transfer (Python)
  3. R/covid/scVI_Results_Integration.R — Integrates predictions and UMAP coordinates into SCE

Workflow:

Step 1: Select HVGs and prepare data (R → h5ad)
Step 2: Train VAE and transfer labels (Python)
Step 3: Integrate predictions and embeddings (h5ad → R)

Key features:

  • Deep learning approach (variational autoencoder)
  • Learns unified latent representation capturing both reference and query variation
  • Includes uncertainty estimates from k-NN voting
  • Also computes UMAP coordinates in joint space
  • More computationally intensive but captures complex patterns

Output columns added to SCE:

  • scvi_prediction — Predicted cell type from k-NN transfer
  • scvi_prediction_merged — Collapsed to broader categories
  • scvi_confidence — Uncertainty score (0-1, lower = more confident)

Reduced dimensions:

  • UMAP_scVI — 2D UMAP in joint reference-query latent space

Merged Annotations

All tools also provided merged versions of the predicted cell type labels by collapsing fine-grained assignments into broader categories. These are added automatically by the addMergedCellTypes() function:

  • Aggregates related cell types (e.g., all T cell subtypes → “T cell”)
  • Dataset-specific mapping rules (COVID vs. MERFISH)
  • Enables comparison across methods at consistent granularity

Example:

  • Original: CD4_naive, CD4_CM, CD4_EM → Merged: CD4 T

Performing cell type annotation

COVID-19 Data:

Code
# Step 1: Create Azimuth reference (run once)
source("R/covid/Create Azimuth Reference.R")

# Step 2: Run Azimuth + SingleR (R only)
source("R/covid/Add R (Azimuth and SingleR) Annotations.R")

# Step 3: Run CellTypist (R + Python)
source("R/covid/CellTypist Annotation Pipeline.R")

# Step 4: Run scVI/scArches (R + Python)
source("R/covid/scVI Annotation Pipeline.R")

MERFISH Data:

Same scripts, located in R/merfish/ instead of R/covid/.

The only difference: uses tier2 as the reference annotation column instead of author_cell_type.


Annotation Comparison

After running all four annotation tools, the resulting SCE object contains:

  • Fine-grained cell type labels (Azimuth L1, Azimuth L2, SingleR, CellTypist, scArches)
  • 4 merged annotation columns (collapsed categories)
  • 5 confidence/uncertainty score columns
  • UMAP coordinates from scArches

This can be used to:

  • Explore agreements and disagreements between tools
  • Interrogate annotation confidence
  • Perform in-depth assessment of annotation ambiguities with scDiagnostics

Next Steps

Once annotations are added, proceed to:

  • scDiagnostics Overview — Learn about core diagnostic functionality of scDiagnostics
  • Results or Results — Perform diagnostic assessments of cell type annotation for two real-world single-cell datasets