Introduction to Bioconductor and the SingleCellExperiment class
- The Bioconductor project provides open-source software packages for the comprehension of high-throughput biological data.
- A
SingleCellExperiment
object is an extension of theSummarizedExperiment
object. -
SingleCellExperiment
objects contain specialized data fields for storing data unique to single-cell analyses, such as thereducedDims
field.
Exploratory data analysis and quality control
- Empty droplets, i.e. droplets that do not contain intact cells and
that capture only ambient or background RNA, should be removed prior to
an analysis. The
emptyDrops
function from the DropletUtils package can be used to identify empty droplets. - Doublets, i.e. instances where two cells are captured in the same
droplet, should also be removed prior to an analysis. The
computeDoubletDensity
anddoubletThresholding
functions from the scDblFinder package can be used to identify doublets. - Quality control (QC) uses metrics such as library size, number of expressed features, and mitochondrial read proportion, based on which low-quality cells can be detected and filtered out. Diagnostic plots of the chosen QC metrics are important to identify possible issues.
- Normalization is required to account for systematic differences in sequencing coverage between libraries and to make measurements comparable between cells. Library size normalization is the most commonly used normalization strategy, and involves dividing all counts for each cell by a cell-specific scaling factor.
- Feature selection aims at selecting genes that contain useful
information about the biology of the system while removing genes that
contain only random noise. Calculate per-gene variance with the
modelGeneVar
function and select highly-variable genes withgetTopHVGs
. - Dimensionality reduction aims at reducing the computational work and at obtaining less noisy and more interpretable results. PCA is a simple and effective linear dimensionality reduction technique that provides interpretable results for further analysis such as clustering of cells. Non-linear approaches such as UMAP and t-SNE can be useful for visualization, but the resulting representations should not be used in downstream analysis.
Cell type annotation
- The two main approaches for cell type annotation are 1) manual annotation of clusters based on marker gene expression, and 2) computational annotation based on annotation transfer from reference datasets or marker gene set enrichment testing.
- For manual annotation, cells are first clustered with unsupervised methods such as graph-based clustering followed by community detection algorithms such as Louvain or Leiden.
- The
clusterCells
function from the scran package provides different algorithms that are commonly used for the clustering of scRNA-seq data. - Once clusters have been obtained, cell type labels are then manually assigned to cell clusters by matching cluster-specific upregulated marker genes with prior knowledge of cell-type markers.
- The
scoreMarkers
function from the scran package package can be used to find candidate marker genes for clusters of cells by ranking differential expression between pairs of clusters. - Computational annotation using published reference datasets or curated gene sets provides a fast, automated, and reproducible alternative to the manual annotation of cell clusters based on marker gene expression.
- The SingleR package is a popular choice for reference-based annotation and assigns labels to cells based on the reference samples with the highest Spearman rank correlations.
- The AUCell package provides an enrichment test to identify curated marker sets that are highly expressed in each cell.
Multi-sample analyses
- Batch effects are systematic technical differences in the observed expression in cells measured in different experimental batches.
- Computational removal of batch-to-batch variation with the
correctExperiment
function from the batchelor package allows us to combine data across multiple batches for a consolidated downstream analysis. - Differential expression (DE) analysis of replicated multi-condition scRNA-seq experiments is typically based on pseudo-bulk expression profiles, generated by summing counts for all cells with the same combination of label and sample.
- The
aggregateAcrossCells
function from the scater package facilitates the creation of pseudo-bulk samples.
- The
pseudoBulkDGE
function from the scran package can be used to detect significant changes in expression between conditions for pseudo-bulk samples consisting of cells of the same type. - Differential abundance (DA) analysis aims at identifying significant changes in cell type abundance across conditions.
- DA analysis uses bulk DE methods such as edgeR and DESeq2, which provide suitable statistical models for count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label.
Working with large data
- Out-of-memory representations can be used to work with single-cell datasets that are too large to fit in memory.
- Parallelization of calculations across genes or cells is an effective strategy for speeding up analysis of large single-cell datasets.
- Fast approximations for nearest neighbor search and singular value composition can speed up essential steps of single-cell analysis with minimal loss of accuracy.
- Converter functions between existing single-cell data formats enable analysis workflows that leverage complementary functionality from poplular single-cell analysis ecosystems.
Accessing data from the Human Cell Atlas (HCA)Single Cell data sources
- The
CuratedAtlasQueryR
package provides programmatic access to single-cell reference maps from the Human Cell Atlas. - The package provides functionality to query for cells of interest
and to download them into a
SingleCellExperiment
object.