Visualization of QC and Annotation Scores

Anthony Christidis

Core for Computational Biomedicine, Harvard Medical School

Andrew Ghazi

Core for Computational Biomedicine, Harvard Medical School

Smriti Chawla

Core for Computational Biomedicine, Harvard Medical School

Nitesh Turaga

Core for Computational Biomedicine, Harvard Medical School

Ludwig Geistlinger

Core for Computational Biomedicine, Harvard Medical School

Robert Gentleman

Core for Computational Biomedicine, Harvard Medical School
Source: vignettes/QCandAnnotationScores.Rmd

QCandAnnotationScores.Rmd

Introduction

In this vignette, we will demonstrate how to use functions for visualizing various aspects of single-cell RNA sequencing data. We will cover:

Scatter Plot of QC Stats vs Cell Type Annotation Scores
Histograms of QC Stats and Annotation Scores
Gene Set Scores on Dimensional Reduction Plots

Datasets

We will use two datasets in this vignette: query_data, and qc_data. Each dataset has been preprocessed to include log-normalized counts, specific metadata columns, PCA, t-SNE, and UMAP results.

Query Data

The query_data is another dataset from HeOrganAtlas, but it serves as a query set for comparison. It contains:

Log-Normalized Counts and Metadata: Similar to the reference data.
SingleR Annotations: Predictions of cell types using SingleR, including SingleR_annotation and annotation_scores.
Gene Set Scores: Scores calculated for gene sets or pathways, providing insights into gene activities across cells.

QC Data

The qc_data originates from the Bunis et al. study focusing on haematopoietic stem and progenitor cells. This dataset includes:

QC Metrics: Metrics such as total library size and mitochondrial gene content.
SingleR Predictions: Predicted cell types based on SingleR annotations.
Annotation Scores: Scores reflecting the confidence in cell type predictions.

# Load library
library(scDiagnostics)

# Load datasets
data("qc_data")
data("query_data")

# Set seed for reproducibility
set.seed(0)

QC Scores

Scatter Plot: QC Stats vs Cell Type Annotation Scores

The scatter plot visualizes the relationship between QC statistics (e.g., total library size or percentage of mitochondrial genes) and cell type annotation scores. This plot helps in understanding how QC metrics influence or correlate with the predicted cell types.

# Generate scatter plot
p1 <- plotQCvsAnnotation(se_object = qc_data,
                         cell_type_col = "SingleR_annotation",
                         qc_col = "total",
                         score_col = "annotation_scores")
p1 + ggplot2::xlab("Library Size")

A scatter plot can reveal patterns such as whether cells with higher library sizes or mitochondrial content tend to be associated with specific annotations. For instance, cells with unusually high mitochondrial content might be identified as low-quality or stressed, potentially affecting their annotations.

Histograms: QC Stats and Annotation Scores Visualization

Histograms provide a distribution view of QC metrics and annotation scores. They help in evaluating the range, central tendency, and spread of these variables across cells.

# Generate histograms
histQCvsAnnotation(se_object = query_data, 
                   cell_type_col = "SingleR_annotation", 
                   qc_col = "percent_mito", 
                   score_col = "annotation_scores")

Histograms are useful for assessing the overall distribution of QC metrics and annotation scores. For example, if the majority of cells have high percent_mito, it might indicate that many cells are stressed or dying, which could impact the quality of the data.

Visualization of Gene Sets or Pathway Scores on Dimensional Reduction Plots

Dimensional reduction plots (PCA, t-SNE, UMAP) are used to visualize the relationships between cells in reduced dimensions. Overlaying gene set scores on these plots provides insights into how specific gene activities are distributed across cell clusters.

# Plot gene set scores on PCA
plotGeneSetScores(se_object = query_data, 
                  method = "PCA", 
                  score_col = "gene_set_scores", 
                  pc_subset = 1:3)

By visualizing gene set scores on PCA or UMAP plots, one can identify clusters of cells with high or low gene set activities. This can help in understanding the biological relevance of different gene sets or pathways in various cell states or types.

Conclusion

This vignette illustrates how to visualize and interpret QC statistics, cell type annotation scores, and gene set scores using single-cell RNA sequencing data. These visualizations are crucial for assessing data quality, understanding cell type annotations, and exploring gene activities, ultimately aiding in the comprehensive analysis of single-cell datasets.

R Session Info

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] scDiagnostics_1.0.0 BiocStyle_2.32.1   

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.34.0 gtable_0.3.5               
 [3] xfun_0.47                   bslib_0.8.0                
 [5] ggplot2_3.5.1               htmlwidgets_1.6.4          
 [7] Biobase_2.64.0              lattice_0.22-6             
 [9] vctrs_0.6.5                 tools_4.4.1                
[11] generics_0.1.3              stats4_4.4.1               
[13] tibble_3.2.1                fansi_1.0.6                
[15] pkgconfig_2.0.3             Matrix_1.7-0               
[17] desc_1.4.3                  S4Vectors_0.42.1           
[19] lifecycle_1.0.4             GenomeInfoDbData_1.2.12    
[21] farver_2.1.2                compiler_4.4.1             
[23] textshaping_0.4.0           munsell_0.5.1              
[25] GenomeInfoDb_1.40.1         htmltools_0.5.8.1          
[27] sass_0.4.9                  yaml_2.3.10                
[29] pkgdown_2.1.1               pillar_1.9.0               
[31] crayon_1.5.3                jquerylib_0.1.4            
[33] SingleCellExperiment_1.26.0 DelayedArray_0.30.1        
[35] cachem_1.1.0                abind_1.4-8                
[37] tidyselect_1.2.1            digest_0.6.37              
[39] dplyr_1.1.4                 bookdown_0.40              
[41] labeling_0.4.3              fastmap_1.2.0              
[43] grid_4.4.1                  colorspace_2.1-1           
[45] cli_3.6.3                   SparseArray_1.4.8          
[47] magrittr_2.0.3              S4Arrays_1.4.1             
[49] utf8_1.2.4                  withr_3.0.1                
[51] UCSC.utils_1.0.0            scales_1.3.0               
[53] rmarkdown_2.28              XVector_0.44.0             
[55] httr_1.4.7                  matrixStats_1.4.1          
[57] ragg_1.3.3                  evaluate_1.0.0             
[59] knitr_1.48                  GenomicRanges_1.56.1       
[61] IRanges_2.38.1              rlang_1.1.4                
[63] glue_1.7.0                  BiocManager_1.30.25        
[65] BiocGenerics_0.50.0         jsonlite_1.8.9             
[67] R6_2.5.1                    MatrixGenerics_1.16.0      
[69] systemfonts_1.1.0           fs_1.6.4                   
[71] zlibbioc_1.50.0