Overview
This page describes how to predict cell type labels with four popular annotation tools (Azimuth, SingleR, CellTypist, and scArches) for the preprocessed data. The workflow applies these tools to assign cell types to cells in the query dataset based on the cell type categories in the reference data.
All four annotation tools are applied to the same query and reference datasets and can be directly assessed with the diagnostic functionality from scDiagnostics.
Workflow Summary
Irrespective of the annotation tool, the workflow always consists of the following steps:
- Data Preparation — Format data for the annotation tool
- Run Annotation — Apply the annotation method
- Results Integration — Add predictions back to your SCE object
For some methods (Azimuth, CellTypist, scArches), a preliminary step creates a reference or sets up the environment.
Method 1: Azimuth (R-based)
What it does: Performs reference-based mapping using weighted nearest neighbors after dimensionality reduction.
Step 1a: Create Azimuth Reference (Run Once)
Before annotating query data, you must create an Azimuth reference from your healthy reference dataset.
Script: R/covid/Create_Azimuth_Reference.R
Key steps:
- Loads healthy reference data
- Runs SCTransform (standard normalization)
- Computes PCA (50 dimensions)
- Generates UMAP in reference space
- Creates searchable
Azimuth reference index
Key parameters:
- 3,000 variable features (for reproducibility)
- 50 PCA dimensions
- 300 UMAP epochs (optimized for efficiency)
- 100 k-neighbors for robustness
Output: Creates data/covid/Azimuth/custom_azimuth_reference/ with reference files and metadata.
Note: This step only needs to be run once per reference dataset.
Step 1b: Annotate Query Data
Scripts:
R/auxiliary/performAzimuthAnnotation.R — Core annotation functionality
R/covid/Add_R_Annotations.R — Main execution script
Key features:
- Integrates query into reference PCA space
- Weighted k-NN voting for label transfer
- Fast and memory-efficient
- Provides both L1 (broad) and L2 (fine-grained) annotations
- Mapping quality scores included
Output columns added to SCE:
azimuth_celltype_l1 / azimuth_celltype_l2 — Predicted labels
azimuth_celltype_l1_merged / azimuth_celltype_l2_merged — Collapsed categories
azimuth_score_l1 / azimuth_score_l2 — Prediction confidence scores
azimuth_mapping_score — Overall mapping quality
Method 2: SingleR (R-based)
What it does: Correlates cell expression profiles against reference data to assign cell types.
Scripts:
R/auxiliary/performSingleRWithSubsampling.R — Core annotation function with smart subsampling
R/covid/Add_R_Annotations.R — Main execution script (also runs SingleR)
Key features:
- Uses Spearman correlation and differential expression analysis
- Conservative subsampling: preserves rare cell types while sampling common types proportionally
- Produces confidence scores via “delta scores”
- No separate reference preparation needed
Output columns added to SCE:
singler_annotations — Predicted cell type labels
singler_annotations_merged — Collapsed to broader categories
singler_scores — Confidence scores (delta scores)
Method 3: CellTypist (Python-based)
What it does: Uses a trained logistic regression classifier to predict cell types.
Scripts (in order):
R/auxiliary/environmentSetupCellTypist.R — Sets up Python environment (conda)
R/covid/CellTypist_Data_Preparation.R — Converts SCE to AnnData format (.h5ad)
python/covid/CellTypist_Annotation.py — Trains classifier and annotates query cells (Python)
R/covid/CellTypist_Results_Integration.R — Integrates predictions back into SCE
Workflow:
Step 1: Prepare data (R → h5ad)
Step 2: Train and annotate (Python)
Step 3: Integrate results (h5ad → R)
Key features:
- Machine learning-based approach
- Requires Python environment with
scanpy and celltypist
- Memory-efficient data handling (removes non-essential metadata)
- Produces confidence scores for each prediction
Output columns added to SCE:
celltypist_predicted_labels — Predicted cell type
celltypist_predicted_labels_merged — Collapsed to broader categories
celltypist_conf_score — Classification confidence (0-1)
Method 4: scArches (Python-based)
What it does: Uses variational autoencoders (VAE) to learn a shared latent space and transfers annotations via weighted k-NN in that space.
Scripts (in order):
R/covid/scVI_Data_Preparation.R — Selects HVGs, converts to AnnData format
python/covid/scVI_Annotation.py — Trains VAE on reference, maps query, performs k-NN transfer (Python)
R/covid/scVI_Results_Integration.R — Integrates predictions and UMAP coordinates into SCE
Workflow:
Step 1: Select HVGs and prepare data (R → h5ad)
Step 2: Train VAE and transfer labels (Python)
Step 3: Integrate predictions and embeddings (h5ad → R)
Key features:
- Deep learning approach (variational autoencoder)
- Learns unified latent representation capturing both reference and query variation
- Includes uncertainty estimates from k-NN voting
- Also computes UMAP coordinates in joint space
- More computationally intensive but captures complex patterns
Output columns added to SCE:
scvi_prediction — Predicted cell type from k-NN transfer
scvi_prediction_merged — Collapsed to broader categories
scvi_confidence — Uncertainty score (0-1, lower = more confident)
Reduced dimensions:
UMAP_scVI — 2D UMAP in joint reference-query latent space
Merged Annotations
All tools also provided merged versions of the predicted cell type labels by collapsing fine-grained assignments into broader categories. These are added automatically by the addMergedCellTypes() function:
- Aggregates related cell types (e.g., all T cell subtypes → “T cell”)
- Dataset-specific mapping rules (COVID vs. MERFISH)
- Enables comparison across methods at consistent granularity
Example:
- Original:
CD4_naive, CD4_CM, CD4_EM → Merged: CD4 T
Annotation Comparison
After running all four annotation tools, the resulting SCE object contains:
- Fine-grained cell type labels (
Azimuth L1, Azimuth L2, SingleR, CellTypist, scArches)
- 4 merged annotation columns (collapsed categories)
- 5 confidence/uncertainty score columns
- UMAP coordinates from
scArches
This can be used to:
- Explore agreements and disagreements between tools
- Interrogate annotation confidence
- Perform in-depth assessment of annotation ambiguities with
scDiagnostics
Next Steps
Once annotations are added, proceed to:
- scDiagnostics Overview — Learn about core diagnostic functionality of scDiagnostics
- Results or Results — Perform diagnostic assessments of cell type annotation for two real-world single-cell datasets