Cell type annotation

Overview

This page describes how to predict cell type labels with four popular annotation tools (Azimuth, SingleR, CellTypist, and scArches) for the preprocessed data. The workflow applies these tools to assign cell types to cells in the query dataset based on the cell type categories in the reference data.

All four annotation tools are applied to the same query and reference datasets and can be directly assessed with the diagnostic functionality from scDiagnostics.

Workflow Summary

Irrespective of the annotation tool, the workflow always consists of the following steps:

Data Preparation — Format data for the annotation tool
Run Annotation — Apply the annotation method
Results Integration — Add predictions back to your SCE object

For some methods (Azimuth, CellTypist, scArches), a preliminary step creates a reference or sets up the environment.

Method 1: `Azimuth` (R-based)

What it does: Performs reference-based mapping using weighted nearest neighbors after dimensionality reduction.

Step 1a: Create `Azimuth` Reference (Run Once)

Before annotating query data, you must create an Azimuth reference from your healthy reference dataset.

Script: R/covid/Create_Azimuth_Reference.R

Key steps:

Loads healthy reference data
Runs SCTransform (standard normalization)
Computes PCA (50 dimensions)
Generates UMAP in reference space
Creates searchable Azimuth reference index

Key parameters:

3,000 variable features (for reproducibility)
50 PCA dimensions
300 UMAP epochs (optimized for efficiency)
100 k-neighbors for robustness

Output: Creates data/covid/Azimuth/custom_azimuth_reference/ with reference files and metadata.

Note: This step only needs to be run once per reference dataset.

Step 1b: Annotate Query Data

Scripts:

R/auxiliary/performAzimuthAnnotation.R — Core annotation functionality
R/covid/Add_R_Annotations.R — Main execution script

Key features:

Integrates query into reference PCA space
Weighted k-NN voting for label transfer
Fast and memory-efficient
Provides both L1 (broad) and L2 (fine-grained) annotations
Mapping quality scores included

Output columns added to SCE:

azimuth_celltype_l1 / azimuth_celltype_l2 — Predicted labels
azimuth_celltype_l1_merged / azimuth_celltype_l2_merged — Collapsed categories
azimuth_score_l1 / azimuth_score_l2 — Prediction confidence scores
azimuth_mapping_score — Overall mapping quality

Method 2: `SingleR` (R-based)

What it does: Correlates cell expression profiles against reference data to assign cell types.

Scripts:

R/auxiliary/performSingleRWithSubsampling.R — Core annotation function with smart subsampling
R/covid/Add_R_Annotations.R — Main execution script (also runs SingleR)

Key features:

Uses Spearman correlation and differential expression analysis
Conservative subsampling: preserves rare cell types while sampling common types proportionally
Produces confidence scores via “delta scores”
No separate reference preparation needed

Output columns added to SCE:

singler_annotations — Predicted cell type labels
singler_annotations_merged — Collapsed to broader categories
singler_scores — Confidence scores (delta scores)

Method 3: `CellTypist` (Python-based)

What it does: Uses a trained logistic regression classifier to predict cell types.

Scripts (in order):

R/auxiliary/environmentSetupCellTypist.R — Sets up Python environment (conda)
R/covid/CellTypist_Data_Preparation.R — Converts SCE to AnnData format (.h5ad)
python/covid/CellTypist_Annotation.py — Trains classifier and annotates query cells (Python)
R/covid/CellTypist_Results_Integration.R — Integrates predictions back into SCE

Workflow:

Step 1: Prepare data (R → h5ad)
Step 2: Train and annotate (Python)
Step 3: Integrate results (h5ad → R)

Key features:

Machine learning-based approach
Requires Python environment with scanpy and celltypist
Memory-efficient data handling (removes non-essential metadata)
Produces confidence scores for each prediction

Output columns added to SCE:

celltypist_predicted_labels — Predicted cell type
celltypist_predicted_labels_merged — Collapsed to broader categories
celltypist_conf_score — Classification confidence (0-1)

Method 4: `scArches` (Python-based)

What it does: Uses variational autoencoders (VAE) to learn a shared latent space and transfers annotations via weighted k-NN in that space.

Scripts (in order):

R/covid/scVI_Data_Preparation.R — Selects HVGs, converts to AnnData format
python/covid/scVI_Annotation.py — Trains VAE on reference, maps query, performs k-NN transfer (Python)
R/covid/scVI_Results_Integration.R — Integrates predictions and UMAP coordinates into SCE

Workflow:

Step 1: Select HVGs and prepare data (R → h5ad)
Step 2: Train VAE and transfer labels (Python)
Step 3: Integrate predictions and embeddings (h5ad → R)

Key features:

Deep learning approach (variational autoencoder)
Learns unified latent representation capturing both reference and query variation
Includes uncertainty estimates from k-NN voting
Also computes UMAP coordinates in joint space
More computationally intensive but captures complex patterns

Output columns added to SCE:

scvi_prediction — Predicted cell type from k-NN transfer
scvi_prediction_merged — Collapsed to broader categories
scvi_confidence — Uncertainty score (0-1, lower = more confident)

Reduced dimensions:

UMAP_scVI — 2D UMAP in joint reference-query latent space

Merged Annotations

All tools also provided merged versions of the predicted cell type labels by collapsing fine-grained assignments into broader categories. These are added automatically by the addMergedCellTypes() function:

Aggregates related cell types (e.g., all T cell subtypes → “T cell”)
Dataset-specific mapping rules (COVID vs. MERFISH)
Enables comparison across methods at consistent granularity

Example:

Original: CD4_naive, CD4_CM, CD4_EM → Merged: CD4 T

Performing cell type annotation

COVID-19 Data:

Code

# Step 1: Create Azimuth reference (run once)
source("R/covid/Create Azimuth Reference.R")

# Step 2: Run Azimuth + SingleR (R only)
source("R/covid/Add R (Azimuth and SingleR) Annotations.R")

# Step 3: Run CellTypist (R + Python)
source("R/covid/CellTypist Annotation Pipeline.R")

# Step 4: Run scVI/scArches (R + Python)
source("R/covid/scVI Annotation Pipeline.R")

MERFISH Data:

Same scripts, located in R/merfish/ instead of R/covid/.

The only difference: uses tier2 as the reference annotation column instead of author_cell_type.

Annotation Comparison

After running all four annotation tools, the resulting SCE object contains:

Fine-grained cell type labels (Azimuth L1, Azimuth L2, SingleR, CellTypist, scArches)
4 merged annotation columns (collapsed categories)
5 confidence/uncertainty score columns
UMAP coordinates from scArches

This can be used to:

Explore agreements and disagreements between tools
Interrogate annotation confidence
Perform in-depth assessment of annotation ambiguities with scDiagnostics

Next Steps

Once annotations are added, proceed to:

scDiagnostics Overview — Learn about core diagnostic functionality of scDiagnostics
Results or Results — Perform diagnostic assessments of cell type annotation for two real-world single-cell datasets

Overview

Workflow Summary

Method 1: Azimuth (R-based)

Step 1a: Create Azimuth Reference (Run Once)

Step 1b: Annotate Query Data

Method 2: SingleR (R-based)

Method 3: CellTypist (Python-based)

Method 4: scArches (Python-based)

Merged Annotations

Performing cell type annotation

COVID-19 Data:

MERFISH Data:

Annotation Comparison

Next Steps

Method 1: `Azimuth` (R-based)

Step 1a: Create `Azimuth` Reference (Run Once)

Method 2: `SingleR` (R-based)

Method 3: `CellTypist` (Python-based)

Method 4: `scArches` (Python-based)