Introduction

This vignette demonstrates the use plotCellTypeMDS(), plotCellTypePCA(), boxplotPCA() and calculateDiscriminantSpace(), which are designed to help assess and compare cell types between query and reference datasets using Multidimensional Scaling (MDS) and Principal Component Analysis (PCA), and Fisher Discriminant Analysis (FDA) respectively.

This vignette also demonstrates how to visualize gene expression data from single-cell RNA sequencing (scRNA-seq) experiments using two functions: plotGeneExpressionDimred() and plotMarkerExpression(). These functions allow researchers to explore gene expression patterns across various dimensions and compare expression distributions between different datasets or cell types.

Preliminaries

In the context of the scDiagnostics package, this vignette illustrates how to evaluate cell type annotations using two distinct datasets:

  • reference_data: This dataset consists of meticulously curated cell type annotations assigned by experts, providing a robust benchmark for assessing the quality of annotations. It serves as a standard reference to evaluate and detect anomalies or inconsistencies within the cell type annotations.

  • query_data: This dataset includes annotations both from expert assessments and those generated by the SingleR package. By comparing these annotations, you can identify discrepancies between manual and automated results, thereby pinpointing potential inconsistencies or areas that may need further scrutiny.

# Load library
library(scDiagnostics)

# Load datasets
data("reference_data")
data("query_data")

# Set seed for reproducibility
set.seed(0)

Visualization of Query vs. Reference Dataset

Plot Reference and Query Cell Types Using MDS

The plotCellTypeMDS() function creates a scatter plot using Multidimensional Scaling (MDS) to visualize the similarity between cell types in query and reference datasets. This function generates a MDS plot that colors cells according to their types based on a dissimilarity matrix calculated from a specified gene set.

# Generate the MDS scatter plot with cell type coloring
plotCellTypeMDS(query_data = query_data, 
                reference_data = reference_data, 
                cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                query_cell_type_col = "SingleR_annotation", 
                ref_cell_type_col = "expert_annotation")

Plot Principal Components for Different Cell Types

The plotCellTypePCA() function projects the query dataset onto the PCA space of the reference dataset. It then plots specified principal components for different cell types, allowing comparison of PCA results between query and reference datasets.

# Generate the MDS scatter plot with cell type coloring
plotCellTypePCA(query_data = query_data, 
                reference_data = reference_data,
                cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
                query_cell_type_col = "SingleR_annotation", 
                ref_cell_type_col = "expert_annotation", 
                pc_subset = 1:3)

Here, plotCellTypePCA() projects the query data into the PCA space of the reference data and plots the specified principal components. The pc_subset argument allows you to select which principal components to display.

Plot Principal Components as Boxplots

The boxplotPCA() function provides a boxplot visualization of principal components for different cell types across query and reference datasets. This function helps in comparing the distributions of PCA scores for specified cell types and datasets.

# Plot the PCA data as boxplots
boxplotPCA(query_data = query_data, 
           reference_data = reference_data,
           cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
           query_cell_type_col = "SingleR_annotation", 
           ref_cell_type_col = "expert_annotation", 
           pc_subset = 1:3)

In this example, boxplotPCA() generates boxplots for the specified principal components, showing the distribution of PCA scores for different cell types and datasets.

Project Query Data onto Discriminant Space of Reference Data

The calculateDiscriminantSpace() function projects query single-cell RNA-seq data onto a discriminant space defined by reference data. This process involves using the reference data to identify key variables, compute discriminant vectors, and project both the reference and query data onto this space. The function then assesses the similarity between the projections of the query data and the reference data using cosine similarity and Mahalanobis distance.

Function Details

The function performs the following steps for each pairwise combination of cell types:

  1. Identify Important Variables: Determine the most significant variables that distinguish the two cell types from the reference data.
  2. Compute Covariance Matrices: Estimate the covariance matrices for each cell type using the Ledoit-Wolf shrinkage method.
  3. Construct Scatter Matrices: Create within-class and between-class scatter matrices.
  4. Solve Eigenvalue Problem: Solve the generalized eigenvalue problem to obtain discriminant vectors.
  5. Project Data: Project both reference and query data onto the discriminant space.
  6. Assess Similarity: Measure the similarity of the query data projections to the reference projections using cosine similarity and Mahalanobis distance.

Example Application

First, applying the function to the SingleR annotation.

# Compute important variables for all pairwise cell comparisons
disc_output <- calculateDiscriminantSpace(
  reference_data = reference_data,
  query_data = query_data, 
  query_cell_type_col = "SingleR_annotation", 
  ref_cell_type_col = "expert_annotation")

# Generate scatter and boxplot
plot(disc_output, plot_type = "scatterplot")

Using Mahalanobis Distance for Anomaly Detection in Single-Cell RNA-Seq Data

We aim to evaluate whether a new batch of query cells fits well within established reference cell types or if there are cells exhibiting unusual characteristics. We will use the Mahalanobis distance calculated by the calculateDiscriminantSpace() function to achieve this.

# Perform discriminant analysis and projection
discriminant_results <- calculateDiscriminantSpace(
  reference_data = reference_data,
  query_data = query_data,
  ref_cell_type_col = "expert_annotation",
  query_cell_type_col = "SingleR_annotation",
  calculate_metrics = TRUE,  # Compute Mahalanobis distance and cosine similarity
  alpha = 0.01  # Significance level for Mahalanobis distance cutoff
)

You can extract and analyze the Mahalanobis distances of query cells relative to reference cell types. High Mahalanobis distances may indicate potential anomalies or novel cell states.

# Extract Mahalanobis distances
mahalanobis_distances <- discriminant_results$`CD4-CD8`$query_mahalanobis_dist

# Identify anomalies based on a threshold
threshold <- discriminant_results$`CD4-CD8`$mahalanobis_crit  # Use the computed cutoff value
anomalies <- mahalanobis_distances[mahalanobis_distances > threshold]

You can now create visualizations to illustrate the distribution of Mahalanobis distances and highlight potential anomalies.

# Load ggplot2 for visualization
library(ggplot2)

# Convert distances to a data frame
distances_df <- data.frame(Distance = mahalanobis_distances,
                           Anomaly = ifelse(mahalanobis_distances > threshold, "Anomaly", "Normal"))

# Plot histogram of Mahalanobis distances
ggplot(distances_df, aes(x = Distance, fill = Anomaly)) +
  geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.7) +
  labs(title = "Histogram of Mahalanobis Distances",
       x = "Mahalanobis Distance",
       y = "Frequency") +
  theme_bw()

  • Low Mahalanobis Distances: These query cells align well with the reference cell types, suggesting they are similar to the established cell types.

  • High Mahalanobis Distances: Cells with high distances may represent outliers or novel, previously uncharacterized cell types. Further investigation may be needed to understand these anomalies.

Project Data onto Sliced Inverse Regression (SIR) Space of Reference Data

The calculateSIRSpace() function leverages Sliced Inverse Regression (SIR) to analyze high-dimensional cell type data by focusing on the directions that best separate different cell types and their subtypes. The function projects data onto a lower-dimensional SIR space, aiming to highlight the most informative features for distinguishing between cell types.

The process starts by handling each cell type as a distinct slice of the data, allowing for a more detailed analysis. When multiple_cond_means is set to TRUE, the function refines this by computing conditional means for each cell type, using clustering to capture finer distinctions within each type. It then performs Singular Value Decomposition (SVD) on these refined means to uncover the principal directions that most effectively differentiate the cell types.

Ultimately, the function provides projections that reveal how cell types and their subtypes relate to each other in a reduced space, facilitating a deeper understanding of cellular variability and distinctions in complex datasets.

# Calculate the SIR space projections
sir_output <- calculateSIRSpace(
  reference_data = reference_data,
  query_data = query_data,
  query_cell_type_col = "SingleR_annotation",
  ref_cell_type_col = "expert_annotation",
  multiple_cond_means = TRUE
)

# Plot the SIR space projections
plot(sir_output, 
     plot_type = "boxplot", 
     sir_subset = 1:3)

In this example, we compute the SIR space projections using the calculateSIRSpace() function, and then we visualize the projections with a boxplot to analyze the distribution of the data in the SIR space. This example helps you understand how different cell types and their subtypes are projected into a common space, facilitating their comparison and interpretation.

The plot below visualizes the contribution of different markers to the variance along the SIR dimensions, helping identify which markers are most influential in differentiating e.g. between B cells (including plasma cells) from other cell types.

# Plot the top contributing markers
plot(sir_output, 
     plot_type = "varplot", 
     sir_subset = 1:3,
     n_top_vars = 10)

Visualization of Marker Expressions

Visualizing Gene Expression in Reduced Dimensions

The plotGeneExpressionDimred() function plots the expression of a specific gene on a dimensional reduction plot. This function is useful for exploring how gene expression varies across cells in the context of reduced dimensions, such as those derived from PCA, t-SNE, or UMAP. Each point on the plot represents a single cell, color-coded according to the expression level of the specified gene.

Let’s visualize the expression of the gene VPREB3 on a PCA plot:

# Visualize VPREB3 gene on a PCA plot
plotGeneExpressionDimred(se_object = query_data, 
                         method = "PCA", 
                         pc_subset = 1:3, 
                         feature = "VPREB3")

In this example, we visualize the expression of the VPREB3 gene on a PCA plot derived from the query_data dataset. The method argument specifies the dimensional reduction technique, which in this case is “PCA.” The pc_subset argument allows us to focus on specific principal components (PCs), here selecting the first five. The feature argument is the name of the gene we want to visualize.

When executed, this function generates a scatter plot with cells colored by their expression levels of VPREB3. The resulting plot can help identify clusters of cells with similar expression patterns or highlight how expression varies across the dataset.

You can also visualize gene expression using other dimensional reduction methods, such as t-SNE or UMAP.

Plotting Gene Expression Distribution

The plotMarkerExpression() function compares the distribution of a specific gene’s expression across an overall dataset and within specific cell types. It generates density plots to visualize these distributions, providing a graphical means of assessing the similarity between reference and query datasets. The function also applies a z-rank normalization to make expression values comparable between datasets.

plotMarkerExpression(
  reference_data = reference_data, 
  query_data = query_data, 
  ref_cell_type_col = "expert_annotation", 
  query_cell_type_col = "SingleR_annotation", 
  gene_name = "VPREB3", 
  cell_type = "B_and_plasma"
)

In this example, the function compares the expression distribution of the VPREB3 gene between the reference and query datasets. The ref_cell_type_col and query_cell_type_col arguments specify the columns in colData that identify the cell types in the reference and query datasets, respectively. The cell_type argument specifies the cell type of interest, and gene_name is the gene whose expression distribution we wish to compare.

When executed, this function generates two density plots: one showing the overall gene expression distribution across all cells and another focusing on the specified cell type. These plots help identify whether the gene’s expression levels are similar between the reference and query datasets, providing insight into dataset alignment.

Conclusion

In this vignette, we have explored various tools for analyzing and visualizing single-cell RNA-seq data to enhance our understanding of cell type annotations and data distributions. Each function demonstrated plays a critical role in this process:

  • plotCellTypeMDS(): Utilizing multidimensional scaling (MDS), this function provides a way to visualize the similarity between cell types in a reduced dimensional space. It is particularly useful for comparing the overall structure of the data and assessing how well cell types are separated based on their MDS coordinates.

  • visualizeCellTypePCA(): This function offers a comprehensive view of cell type-specific PCA results, allowing for the exploration of how different cell types are distributed in the PCA space. It helps in understanding the separation and clustering of cell types based on principal components, facilitating the identification of patterns and relationships between cell types.

  • boxplotPCA(): This function enables the visualization of principal component analysis (PCA) results through boxplots, providing insights into the distribution of PCA scores across different cell types. By examining these plots, researchers can assess variability and identify potential outliers or inconsistencies within the data.

  • calculateDiscriminantSpace(): This function projects query data onto a discriminant space defined by reference data, using discriminant analysis to identify important variables and compute projections. It also calculates similarity metrics, such as cosine similarity and Mahalanobis distance, to evaluate how closely query data aligns with reference data. This function is instrumental in assessing the accuracy of cell type annotations and detecting potential anomalies.

  • plotGeneExpressionDimred(): This function visualizes the expression of a specific gene on a dimensional reduction plot, such as those generated by PCA, t-SNE, or UMAP. It allows for the exploration of gene expression variation across cells in reduced dimensional space. Each point on the plot represents an individual cell, with color-coding indicating the expression level of the selected gene.

  • plotMarkerExpression(): This function visualizes the expression distribution of a specific gene across both the overall dataset and individual cell types. It generates density plots to compare these distributions, enabling the assessment of how similar the reference and query datasets are. Additionally, the function applies z-rank normalization to standardize expression values, ensuring comparability between datasets.

Together, these functions provide a comprehensive toolkit for analyzing single-cell RNA-seq data, from initial exploratory data analysis to advanced comparison and projection methods. By integrating these tools into your workflow, you can gain deeper insights into cell type annotations, improve data quality, and enhance the robustness of your single-cell analyses.


R Session Info

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.5.1        scDiagnostics_0.99.8 BiocStyle_2.32.1    

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.34.0 gtable_0.3.5               
 [3] xfun_0.47                   bslib_0.8.0                
 [5] htmlwidgets_1.6.4           Biobase_2.64.0             
 [7] lattice_0.22-6              vctrs_0.6.5                
 [9] tools_4.4.1                 generics_0.1.3             
[11] parallel_4.4.1              stats4_4.4.1               
[13] tibble_3.2.1                fansi_1.0.6                
[15] cluster_2.1.6               BiocNeighbors_1.22.0       
[17] pkgconfig_2.0.3             Matrix_1.7-0               
[19] desc_1.4.3                  S4Vectors_0.42.1           
[21] lifecycle_1.0.4             GenomeInfoDbData_1.2.12    
[23] farver_2.1.2                compiler_4.4.1             
[25] textshaping_0.4.0           munsell_0.5.1              
[27] bluster_1.14.0              codetools_0.2-20           
[29] GenomeInfoDb_1.40.1         htmltools_0.5.8.1          
[31] sass_0.4.9                  yaml_2.3.10                
[33] pkgdown_2.1.1               pillar_1.9.0               
[35] crayon_1.5.3                jquerylib_0.1.4            
[37] BiocParallel_1.38.0         SingleCellExperiment_1.26.0
[39] DelayedArray_0.30.1         cachem_1.1.0               
[41] abind_1.4-8                 tidyselect_1.2.1           
[43] digest_0.6.37               dplyr_1.1.4                
[45] bookdown_0.40               labeling_0.4.3             
[47] fastmap_1.2.0               grid_4.4.1                 
[49] colorspace_2.1-1            cli_3.6.3                  
[51] SparseArray_1.4.8           magrittr_2.0.3             
[53] patchwork_1.3.0             S4Arrays_1.4.1             
[55] utf8_1.2.4                  withr_3.0.1                
[57] UCSC.utils_1.0.0            scales_1.3.0               
[59] rmarkdown_2.28              XVector_0.44.0             
[61] httr_1.4.7                  matrixStats_1.4.1          
[63] igraph_2.0.3                ranger_0.16.0              
[65] ragg_1.3.3                  evaluate_1.0.0             
[67] knitr_1.48                  GenomicRanges_1.56.1       
[69] IRanges_2.38.1              rlang_1.1.4                
[71] Rcpp_1.0.13                 glue_1.7.0                 
[73] BiocManager_1.30.25         BiocGenerics_0.50.0        
[75] jsonlite_1.8.8              R6_2.5.1                   
[77] MatrixGenerics_1.16.0       systemfonts_1.1.0          
[79] fs_1.6.4                    zlibbioc_1.50.0