vignettes/VisualizationTools.Rmd
VisualizationTools.Rmd
This vignette demonstrates the use plotCellTypeMDS()
,
plotCellTypePCA()
, boxplotPCA()
and
calculateDiscriminantSpace()
, which are designed to help
assess and compare cell types between query and reference datasets using
Multidimensional Scaling (MDS) and Principal Component Analysis (PCA),
and Fisher Discriminant Analysis (FDA) respectively.
This vignette also demonstrates how to visualize gene expression data
from single-cell RNA sequencing (scRNA-seq) experiments using two
functions: plotGeneExpressionDimred()
and
plotMarkerExpression()
. These functions allow researchers
to explore gene expression patterns across various dimensions and
compare expression distributions between different datasets or cell
types.
In the context of the scDiagnostics
package, this
vignette illustrates how to evaluate cell type annotations using two
distinct datasets:
reference_data
: This dataset consists of
meticulously curated cell type annotations assigned by experts,
providing a robust benchmark for assessing the quality of annotations.
It serves as a standard reference to evaluate and detect anomalies or
inconsistencies within the cell type annotations.
query_data
: This dataset includes annotations both
from expert assessments and those generated by the SingleR
package. By comparing these annotations, you can identify discrepancies
between manual and automated results, thereby pinpointing potential
inconsistencies or areas that may need further scrutiny.
# Load library
library(scDiagnostics)
# Load datasets
data("reference_data")
data("query_data")
# Set seed for reproducibility
set.seed(0)
The plotCellTypeMDS()
function creates a scatter plot
using Multidimensional Scaling (MDS) to visualize the similarity between
cell types in query and reference datasets. This function generates a
MDS plot that colors cells according to their types based on a
dissimilarity matrix calculated from a specified gene set.
# Generate the MDS scatter plot with cell type coloring
plotCellTypeMDS(query_data = query_data,
reference_data = reference_data,
cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation")
The plotCellTypePCA()
function projects the query
dataset onto the PCA space of the reference dataset. It then plots
specified principal components for different cell types, allowing
comparison of PCA results between query and reference datasets.
# Generate the MDS scatter plot with cell type coloring
plotCellTypePCA(query_data = query_data,
reference_data = reference_data,
cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation",
pc_subset = 1:3)
Here, plotCellTypePCA()
projects the query data into the
PCA space of the reference data and plots the specified principal
components. The pc_subset
argument allows you to select
which principal components to display.
The boxplotPCA()
function provides a boxplot
visualization of principal components for different cell types across
query and reference datasets. This function helps in comparing the
distributions of PCA scores for specified cell types and datasets.
# Plot the PCA data as boxplots
boxplotPCA(query_data = query_data,
reference_data = reference_data,
cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation",
pc_subset = 1:3)
In this example, boxplotPCA()
generates boxplots for the
specified principal components, showing the distribution of PCA scores
for different cell types and datasets.
The calculateDiscriminantSpace()
function projects query
single-cell RNA-seq data onto a discriminant space defined by reference
data. This process involves using the reference data to identify key
variables, compute discriminant vectors, and project both the reference
and query data onto this space. The function then assesses the
similarity between the projections of the query data and the reference
data using cosine similarity and Mahalanobis distance.
The function performs the following steps for each pairwise combination of cell types:
First, applying the function to the SingleR
annotation.
# Compute important variables for all pairwise cell comparisons
disc_output <- calculateDiscriminantSpace(
reference_data = reference_data,
query_data = query_data,
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation")
# Generate scatter and boxplot
plot(disc_output, plot_type = "scatterplot")
We aim to evaluate whether a new batch of query cells fits well
within established reference cell types or if there are cells exhibiting
unusual characteristics. We will use the Mahalanobis distance calculated
by the calculateDiscriminantSpace()
function to achieve
this.
# Perform discriminant analysis and projection
discriminant_results <- calculateDiscriminantSpace(
reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
calculate_metrics = TRUE, # Compute Mahalanobis distance and cosine similarity
alpha = 0.01 # Significance level for Mahalanobis distance cutoff
)
You can extract and analyze the Mahalanobis distances of query cells relative to reference cell types. High Mahalanobis distances may indicate potential anomalies or novel cell states.
# Extract Mahalanobis distances
mahalanobis_distances <- discriminant_results$`CD4-CD8`$query_mahalanobis_dist
# Identify anomalies based on a threshold
threshold <- discriminant_results$`CD4-CD8`$mahalanobis_crit # Use the computed cutoff value
anomalies <- mahalanobis_distances[mahalanobis_distances > threshold]
You can now create visualizations to illustrate the distribution of Mahalanobis distances and highlight potential anomalies.
# Load ggplot2 for visualization
library(ggplot2)
# Convert distances to a data frame
distances_df <- data.frame(Distance = mahalanobis_distances,
Anomaly = ifelse(mahalanobis_distances > threshold, "Anomaly", "Normal"))
# Plot histogram of Mahalanobis distances
ggplot(distances_df, aes(x = Distance, fill = Anomaly)) +
geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.7) +
labs(title = "Histogram of Mahalanobis Distances",
x = "Mahalanobis Distance",
y = "Frequency") +
theme_bw()
Low Mahalanobis Distances: These query cells align well with the reference cell types, suggesting they are similar to the established cell types.
High Mahalanobis Distances: Cells with high distances may represent outliers or novel, previously uncharacterized cell types. Further investigation may be needed to understand these anomalies.
The calculateSIRSpace()
function leverages Sliced
Inverse Regression (SIR) to analyze high-dimensional cell type data by
focusing on the directions that best separate different cell types and
their subtypes. The function projects data onto a lower-dimensional SIR
space, aiming to highlight the most informative features for
distinguishing between cell types.
The process starts by handling each cell type as a distinct slice of
the data, allowing for a more detailed analysis. When
multiple_cond_means
is set to TRUE
, the
function refines this by computing conditional means for each cell type,
using clustering to capture finer distinctions within each type. It then
performs Singular Value Decomposition (SVD) on these refined means to
uncover the principal directions that most effectively differentiate the
cell types.
Ultimately, the function provides projections that reveal how cell types and their subtypes relate to each other in a reduced space, facilitating a deeper understanding of cellular variability and distinctions in complex datasets.
# Calculate the SIR space projections
sir_output <- calculateSIRSpace(
reference_data = reference_data,
query_data = query_data,
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation",
multiple_cond_means = TRUE
)
# Plot the SIR space projections
plot(sir_output,
plot_type = "boxplot",
sir_subset = 1:3)
In this example, we compute the SIR space projections using the
calculateSIRSpace()
function, and then we visualize the
projections with a boxplot to analyze the distribution of the data in
the SIR space. This example helps you understand how different cell
types and their subtypes are projected into a common space, facilitating
their comparison and interpretation.
The plot below visualizes the contribution of different markers to the variance along the SIR dimensions, helping identify which markers are most influential in differentiating e.g. between B cells (including plasma cells) from other cell types.
# Plot the top contributing markers
plot(sir_output,
plot_type = "varplot",
sir_subset = 1:3,
n_top_vars = 10)
The plotGeneExpressionDimred()
function plots the
expression of a specific gene on a dimensional reduction plot. This
function is useful for exploring how gene expression varies across cells
in the context of reduced dimensions, such as those derived from PCA,
t-SNE, or UMAP. Each point on the plot represents a single cell,
color-coded according to the expression level of the specified gene.
Let’s visualize the expression of the gene VPREB3 on a PCA plot:
# Visualize VPREB3 gene on a PCA plot
plotGeneExpressionDimred(se_object = query_data,
method = "PCA",
pc_subset = 1:3,
feature = "VPREB3")
In this example, we visualize the expression of the
VPREB3 gene on a PCA plot derived from the
query_data
dataset. The method
argument
specifies the dimensional reduction technique, which in this case is
“PCA.” The pc_subset
argument allows us to focus on
specific principal components (PCs), here selecting the first five. The
feature
argument is the name of the gene we want to
visualize.
When executed, this function generates a scatter plot with cells colored by their expression levels of VPREB3. The resulting plot can help identify clusters of cells with similar expression patterns or highlight how expression varies across the dataset.
You can also visualize gene expression using other dimensional reduction methods, such as t-SNE or UMAP.
The plotMarkerExpression()
function compares the
distribution of a specific gene’s expression across an overall dataset
and within specific cell types. It generates density plots to visualize
these distributions, providing a graphical means of assessing the
similarity between reference and query datasets. The function also
applies a z-rank normalization to make expression values comparable
between datasets.
plotMarkerExpression(
reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
gene_name = "VPREB3",
cell_type = "B_and_plasma"
)
In this example, the function compares the expression distribution of
the VPREB3 gene between the reference and query
datasets. The ref_cell_type_col
and
query_cell_type_col
arguments specify the columns in
colData
that identify the cell types in the reference and
query datasets, respectively. The cell_type
argument
specifies the cell type of interest, and gene_name
is the
gene whose expression distribution we wish to compare.
When executed, this function generates two density plots: one showing the overall gene expression distribution across all cells and another focusing on the specified cell type. These plots help identify whether the gene’s expression levels are similar between the reference and query datasets, providing insight into dataset alignment.
In this vignette, we have explored various tools for analyzing and visualizing single-cell RNA-seq data to enhance our understanding of cell type annotations and data distributions. Each function demonstrated plays a critical role in this process:
plotCellTypeMDS()
: Utilizing multidimensional
scaling (MDS), this function provides a way to visualize the similarity
between cell types in a reduced dimensional space. It is particularly
useful for comparing the overall structure of the data and assessing how
well cell types are separated based on their MDS coordinates.
visualizeCellTypePCA()
: This function offers a
comprehensive view of cell type-specific PCA results, allowing for the
exploration of how different cell types are distributed in the PCA
space. It helps in understanding the separation and clustering of cell
types based on principal components, facilitating the identification of
patterns and relationships between cell types.
boxplotPCA()
: This function enables the
visualization of principal component analysis (PCA) results through
boxplots, providing insights into the distribution of PCA scores across
different cell types. By examining these plots, researchers can assess
variability and identify potential outliers or inconsistencies within
the data.
calculateDiscriminantSpace()
: This function projects
query data onto a discriminant space defined by reference data, using
discriminant analysis to identify important variables and compute
projections. It also calculates similarity metrics, such as cosine
similarity and Mahalanobis distance, to evaluate how closely query data
aligns with reference data. This function is instrumental in assessing
the accuracy of cell type annotations and detecting potential
anomalies.
plotGeneExpressionDimred()
: This function visualizes
the expression of a specific gene on a dimensional reduction plot, such
as those generated by PCA, t-SNE, or UMAP. It allows for the exploration
of gene expression variation across cells in reduced dimensional space.
Each point on the plot represents an individual cell, with color-coding
indicating the expression level of the selected gene.
plotMarkerExpression()
: This function visualizes the
expression distribution of a specific gene across both the overall
dataset and individual cell types. It generates density plots to compare
these distributions, enabling the assessment of how similar the
reference and query datasets are. Additionally, the function applies
z-rank normalization to standardize expression values, ensuring
comparability between datasets.
Together, these functions provide a comprehensive toolkit for analyzing single-cell RNA-seq data, from initial exploratory data analysis to advanced comparison and projection methods. By integrating these tools into your workflow, you can gain deeper insights into cell type annotations, improve data quality, and enhance the robustness of your single-cell analyses.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.5.1 scDiagnostics_0.99.8 BiocStyle_2.32.1
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.34.0 gtable_0.3.5
[3] xfun_0.47 bslib_0.8.0
[5] htmlwidgets_1.6.4 Biobase_2.64.0
[7] lattice_0.22-6 vctrs_0.6.5
[9] tools_4.4.1 generics_0.1.3
[11] parallel_4.4.1 stats4_4.4.1
[13] tibble_3.2.1 fansi_1.0.6
[15] cluster_2.1.6 BiocNeighbors_1.22.0
[17] pkgconfig_2.0.3 Matrix_1.7-0
[19] desc_1.4.3 S4Vectors_0.42.1
[21] lifecycle_1.0.4 GenomeInfoDbData_1.2.12
[23] farver_2.1.2 compiler_4.4.1
[25] textshaping_0.4.0 munsell_0.5.1
[27] bluster_1.14.0 codetools_0.2-20
[29] GenomeInfoDb_1.40.1 htmltools_0.5.8.1
[31] sass_0.4.9 yaml_2.3.10
[33] pkgdown_2.1.1 pillar_1.9.0
[35] crayon_1.5.3 jquerylib_0.1.4
[37] BiocParallel_1.38.0 SingleCellExperiment_1.26.0
[39] DelayedArray_0.30.1 cachem_1.1.0
[41] abind_1.4-8 tidyselect_1.2.1
[43] digest_0.6.37 dplyr_1.1.4
[45] bookdown_0.40 labeling_0.4.3
[47] fastmap_1.2.0 grid_4.4.1
[49] colorspace_2.1-1 cli_3.6.3
[51] SparseArray_1.4.8 magrittr_2.0.3
[53] patchwork_1.3.0 S4Arrays_1.4.1
[55] utf8_1.2.4 withr_3.0.1
[57] UCSC.utils_1.0.0 scales_1.3.0
[59] rmarkdown_2.28 XVector_0.44.0
[61] httr_1.4.7 matrixStats_1.4.1
[63] igraph_2.0.3 ranger_0.16.0
[65] ragg_1.3.3 evaluate_1.0.0
[67] knitr_1.48 GenomicRanges_1.56.1
[69] IRanges_2.38.1 rlang_1.1.4
[71] Rcpp_1.0.13 glue_1.7.0
[73] BiocManager_1.30.25 BiocGenerics_0.50.0
[75] jsonlite_1.8.8 R6_2.5.1
[77] MatrixGenerics_1.16.0 systemfonts_1.1.0
[79] fs_1.6.4 zlibbioc_1.50.0