vignettes/AnnotationAnomalies.Rmd
AnnotationAnomalies.Rmd
The scDiagnostics
package provides powerful tools for
anomaly detection in single-cell data, enabling researchers to identify
and analyze outliers in complex datasets. Central to this process is the
detectAnomaly()
function, which integrates dimensionality
reduction through Principal Component Analysis (PCA) with the robust
capabilities of the isolation forest algorithm.
In single-cell analysis, detecting anomalies is crucial for
identifying potential data issues, such as mislabeled cells, technical
artifacts, or biologically distinct subpopulations. The
detectAnomaly()
function offers a versatile approach to
anomaly detection by allowing users to project data onto a PCA space and
apply isolation forests to uncover outliers. Whether working solely with
a reference dataset or comparing a query dataset against a
well-characterized reference, this function provides detailed insights
into potential anomalies.
This vignette illustrates how to effectively use the
detectAnomaly()
function in various scenarios. We explore
both cell-type-specific and global anomaly detection, demonstrate the
utility of integrating query data with reference data, and offer
guidance on interpreting the results. Additionally, we show how to
extend this analysis by combining anomaly detection with PCA loadings
using the calculateCellSimilarityPCA()
function, providing
a comprehensive toolkit for investigating the structure and quality of
single-cell data.
Whether you’re looking to enhance the accuracy of your cell type
annotations or identify subtle deviations in your data, the tools
provided by scDiagnostics
will empower you to conduct
thorough and nuanced assessments of your single-cell datasets.
In the context of the scDiagnostics
package, the
following datasets illustrate the application of these tools:
reference_data
: A curated and processed dataset
containing expert-assigned cell type annotations. This dataset serves as
a reference for comparison and can be used alone to detect anomalies
within the reference annotations.
query_data
: A dataset that also includes
expert-assigned cell type annotations, but additionally features
annotations generated by the SingleR
package. This allows for the comparison of expert annotations with those
produced by an automated method to detect inconsistencies or
anomalies.
# Load library
library(scDiagnostics)
# Load datasets
data("reference_data")
data("query_data")
# Set seed for reproducibility
set.seed(0)
By using these datasets, you can leverage the package’s tools to
compare annotations between reference_data
and
query_data
, or analyze reference_data
alone to
identify potential issues. The package’s flexibility supports various
analysis scenarios, whether you need to assess overall annotation
quality or focus on specific cell types.
Through these capabilities, scDiagnostics
empowers you
to perform thorough and nuanced assessments of cell type annotations,
enhancing the accuracy and reliability of your analyses.
detectAnomaly()
Function
The detectAnomaly()
function integrates dimensionality
reduction via PCA with the isolation forest algorithm to detect
anomalies in single-cell data. By projecting both reference and query
datasets (if available) onto the PCA space of the reference data, the
function trains an isolation forest model on reference data in PCA space
to pinpoint anomalies in the reference or query data. This approach is
highly versatile:
The function takes a
r
BiocStyle::Biocpkg(“SingleCellExperiment”)object as
reference_dataand trains an isolation forest model on the reference PCA-projected data, with an optional
query_datafor projecting onto this PCA space for anomaly detection. You can specify cell type annotations through
ref_cell_type_coland
query_cell_type_col, and limit the analysis to certain cell types using the
cell_typesparameter. The function allows you to select specific principal components to use to train the isolation forest via
pc_subset, adjust the number of trees with
n_tree, and set an
anomaly_threshold`
for classifying anomalies.
The function returns several outputs: anomaly_scores
indicating the likelihood of each cell being an anomaly, a logical
vector (anomaly
) identifying these anomalies, PCA
projections for the reference data (reference_mat_subset
)
and optionally for the query data (query_mat_subset
), and
the proportion of variance explained by the selected principal
components (var_explained
).
detectAnomaly()
Examples
This section demonstrates how to use the detectAnomaly()
function when both reference and query datasets are provided. It
includes examples of analyzing anomalies for specific cell types and
globally across all data.
In this example, we analyze anomalies specifically for the “CD4” cell type. The anomaly scores are trained on the PCA projections of the “CD4” cells from the reference dataset. If query data is provided, anomaly scores for the query data are predicted based on the PCA projections of the query data onto the reference PCA space for the “CD4” cell type.
# Perform anomaly detection
anomaly_output <- detectAnomaly(reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
pc_subset = 1:5,
n_tree = 500,
anomaly_treshold = 0.6)
# Plot the output for the "CD4" cell type
plot(anomaly_output,
cell_type = "CD4",
pc_subset = 1:5,
data_type = "query")
In this example, we analyze anomalies specifically for the “CD4” cell type. The anomaly scores are trained on the PCA projections of the “CD4” cells from the reference dataset. If query data is provided, anomaly scores for the query data are predicted based on the PCA projections of the query data onto the reference PCA space for the “CD4” cell type.
Here, we perform global anomaly detection by setting
cell_type = NULL
. In this case, the isolation forest is
trained on PCA projections of all cells in the reference data combined.
The global anomaly scores are then computed for both reference and query
datasets.
# Perform anomaly detection
anomaly_output <- detectAnomaly(reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
pc_subset = 1:5,
n_tree = 500,
anomaly_treshold = 0.6)
# Plot the global anomaly scores
plot(anomaly_output,
cell_type = NULL, # Plot all cell types
pc_subset = 1:5,
data_type = "query")
Setting cell_type = NULL
means that the anomaly
detection is done globally. The isolation forest is trained on PCA
projections of all cells from the reference dataset. Anomaly scores are
then predicted for the query data based on these global PCA projections.
The plot provides a comprehensive view of anomalies across all cell
types in the query dataset.
Note that if you do not provide a query dataset to the
detectAnomaly()
function, the function will only return the
anomaly output data on the reference data.
The detectAnomaly()
function identifies anomalous cells
in a dataset by detecting outliers based on PCA results. Once anomalies
are detected, calculateCellSimilarityPCA()
evaluates how
these outliers influence principal component directions. This analysis
helps determine if anomalous cells significantly impact later PCs, which
capture finer variations in the data. By combining these functions, you
can both pinpoint anomalies and understand their effect on PCA
directions, providing deeper insights into the data structure.
# Detect anomalies and select top anomalies for further analysis
anomaly_output <- detectAnomaly(reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
pc_subset = 1:10,
n_tree = 500,
anomaly_treshold = 0.5)
# Select the top 6 anomalies based on the anomaly scores
top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores,
decreasing = TRUE)[1:6])
# Calculate cosine similarity between the top anomalies and top 25 PCs
cosine_similarities <- calculateCellSimilarityPCA(reference_data,
cell_names = top6_anomalies,
pc_subset = 1:25,
n_top_vars = 50)
# Plot the cosine similarities across PCs
round(cosine_similarities[, paste0("PC", 15:25)], 2)
#> PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
#> ACACCAATCTTAACCT-1 0.10 -0.05 0.24 0.01 -0.05 -0.08 0.16 -0.15 0.01 -0.08
#> GGCTGGTAGCCAACAG-1 0.06 0.13 0.33 -0.06 -0.13 0.03 0.07 -0.01 0.11 -0.04
#> TACACGAAGCGATAGC-1 0.07 0.02 -0.01 0.12 0.16 0.02 -0.08 0.14 -0.06 0.26
#> CATGGCGCATGCTGGC-1 0.13 -0.12 0.08 0.14 -0.24 -0.03 -0.01 -0.17 -0.17 0.05
#> GTACTTTTCCGTTGTC-1 0.16 0.01 0.24 -0.03 -0.02 -0.04 -0.11 -0.08 -0.15 0.16
#> ATTTCTGAGCGATATA-1 0.05 0.29 0.24 0.18 -0.20 0.05 -0.15 0.08 0.15 -0.11
#> PC25
#> ACACCAATCTTAACCT-1 0.01
#> GGCTGGTAGCCAACAG-1 0.26
#> TACACGAAGCGATAGC-1 0.36
#> CATGGCGCATGCTGGC-1 0.01
#> GTACTTTTCCGTTGTC-1 0.17
#> ATTTCTGAGCGATATA-1 0.06
Note that there is also a plot method for the object return for
calculateCellSimilarityPCA()
. See reference manual for
details.
This vignette section provides a comprehensive overview of how to use
the detectAnomaly()
function for both reference and query
datasets, with specific and global anomaly detection examples, along
with how to visualize the results using the plot method. With
calculateCellSimilarityPCA()
, we can assess whether cells
have a large effect on the loadings of PC vectors, affecting the
downstream analysis.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scDiagnostics_0.99.8 BiocStyle_2.32.1
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.34.0 gtable_0.3.5
[3] xfun_0.47 bslib_0.8.0
[5] ggplot2_3.5.1 htmlwidgets_1.6.4
[7] Biobase_2.64.0 lattice_0.22-6
[9] generics_0.1.3 vctrs_0.6.5
[11] tools_4.4.1 stats4_4.4.1
[13] parallel_4.4.1 tibble_3.2.1
[15] fansi_1.0.6 pkgconfig_2.0.3
[17] Matrix_1.7-0 desc_1.4.3
[19] S4Vectors_0.42.1 lifecycle_1.0.4
[21] GenomeInfoDbData_1.2.12 farver_2.1.2
[23] compiler_4.4.1 textshaping_0.4.0
[25] munsell_0.5.1 GenomeInfoDb_1.40.1
[27] htmltools_0.5.8.1 sass_0.4.9
[29] yaml_2.3.10 pkgdown_2.1.1
[31] pillar_1.9.0 crayon_1.5.3
[33] jquerylib_0.1.4 SingleCellExperiment_1.26.0
[35] DelayedArray_0.30.1 cachem_1.1.0
[37] abind_1.4-8 tidyselect_1.2.1
[39] digest_0.6.37 dplyr_1.1.4
[41] bookdown_0.40 labeling_0.4.3
[43] fastmap_1.2.0 grid_4.4.1
[45] colorspace_2.1-1 cli_3.6.3
[47] SparseArray_1.4.8 magrittr_2.0.3
[49] S4Arrays_1.4.1 utf8_1.2.4
[51] withr_3.0.1 UCSC.utils_1.0.0
[53] scales_1.3.0 rmarkdown_2.28
[55] XVector_0.44.0 httr_1.4.7
[57] matrixStats_1.4.1 ragg_1.3.3
[59] isotree_0.6.1-1 evaluate_1.0.0
[61] knitr_1.48 GenomicRanges_1.56.1
[63] IRanges_2.38.1 rlang_1.1.4
[65] Rcpp_1.0.13 glue_1.7.0
[67] BiocManager_1.30.25 BiocGenerics_0.50.0
[69] jsonlite_1.8.8 R6_2.5.1
[71] MatrixGenerics_1.16.0 systemfonts_1.1.0
[73] fs_1.6.4 zlibbioc_1.50.0