Orchestrating Large-Scale Single-Cell Analysis with Bioconductor: Key Points

Introduction to Bioconductor and the SingleCellExperiment class

The Bioconductor project provides open-source software packages for the comprehension of high-throughput biological data.
A SingleCellExperiment object is an extension of the SummarizedExperiment object.
SingleCellExperiment objects contain specialized data fields for storing data unique to single-cell analyses, such as the reducedDims field.

Exploratory data analysis and quality control

Empty droplets, i.e. droplets that do not contain intact cells and that capture only ambient or background RNA, should be removed prior to an analysis. The emptyDrops function from the DropletUtils package can be used to identify empty droplets.
Doublets, i.e. instances where two cells are captured in the same droplet, should also be removed prior to an analysis. The computeDoubletDensity and doubletThresholding functions from the scDblFinder package can be used to identify doublets.
Quality control (QC) uses metrics such as library size, number of expressed features, and mitochondrial read proportion, based on which low-quality cells can be detected and filtered out. Diagnostic plots of the chosen QC metrics are important to identify possible issues.
Normalization is required to account for systematic differences in sequencing coverage between libraries and to make measurements comparable between cells. Library size normalization is the most commonly used normalization strategy, and involves dividing all counts for each cell by a cell-specific scaling factor.
Feature selection aims at selecting genes that contain useful information about the biology of the system while removing genes that contain only random noise. Calculate per-gene variance with the modelGeneVar function and select highly-variable genes with getTopHVGs.
Dimensionality reduction aims at reducing the computational work and at obtaining less noisy and more interpretable results. PCA is a simple and effective linear dimensionality reduction technique that provides interpretable results for further analysis such as clustering of cells. Non-linear approaches such as UMAP and t-SNE can be useful for visualization, but the resulting representations should not be used in downstream analysis.

Cell type annotation

The two main approaches for cell type annotation are 1) manual annotation of clusters based on marker gene expression, and 2) computational annotation based on annotation transfer from reference datasets or marker gene set enrichment testing.
For manual annotation, cells are first clustered with unsupervised methods such as graph-based clustering followed by community detection algorithms such as Louvain or Leiden.
The clusterCells function from the scran package provides different algorithms that are commonly used for the clustering of scRNA-seq data.
Once clusters have been obtained, cell type labels are then manually assigned to cell clusters by matching cluster-specific upregulated marker genes with prior knowledge of cell-type markers.
The scoreMarkers function from the scran package package can be used to find candidate marker genes for clusters of cells by ranking differential expression between pairs of clusters.
Computational annotation using published reference datasets or curated gene sets provides a fast, automated, and reproducible alternative to the manual annotation of cell clusters based on marker gene expression.
The SingleR package is a popular choice for reference-based annotation and assigns labels to cells based on the reference samples with the highest Spearman rank correlations.
The AUCell package provides an enrichment test to identify curated marker sets that are highly expressed in each cell.

Multi-sample analyses

Batch effects are systematic technical differences in the observed expression in cells measured in different experimental batches.
Computational removal of batch-to-batch variation with the correctExperiment function from the batchelor package allows us to combine data across multiple batches for a consolidated downstream analysis.
Differential expression (DE) analysis of replicated multi-condition scRNA-seq experiments is typically based on pseudo-bulk expression profiles, generated by summing counts for all cells with the same combination of label and sample.
The aggregateAcrossCells function from the scater package facilitates the creation of pseudo-bulk samples.
The pseudoBulkDGE function from the scran package can be used to detect significant changes in expression between conditions for pseudo-bulk samples consisting of cells of the same type.
Differential abundance (DA) analysis aims at identifying significant changes in cell type abundance across conditions.
DA analysis uses bulk DE methods such as edgeR and DESeq2, which provide suitable statistical models for count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label.

Working with large data

Out-of-memory representations can be used to work with single-cell datasets that are too large to fit in memory.
Parallelization of calculations across genes or cells is an effective strategy for speeding up analysis of large single-cell datasets.
Fast approximations for nearest neighbor search and singular value composition can speed up essential steps of single-cell analysis with minimal loss of accuracy.
Converter functions between existing single-cell data formats enable analysis workflows that leverage complementary functionality from poplular single-cell analysis ecosystems.

Accessing data from the Human Cell Atlas (HCA)Single Cell data sources

The CuratedAtlasQueryR package provides programmatic access to single-cell reference maps from the Human Cell Atlas.
The package provides functionality to query for cells of interest and to download them into a SingleCellExperiment object.