R/calculateTopLoadingGeneShifts.R
, R/plot.calculateTopLoadingGeneShifts.R
calculateTopLoadingGeneShifts.Rd
This function identifies genes with the highest loadings for specified principal components and performs statistical tests to detect distributional differences between query and reference data. It also calculates the proportion of variance explained by each principal component within specific cell types. Optionally, it can detect anomalous cells using isolation forests.
This function creates visualizations showing expression distributions for top loading genes that exhibit distributional differences between query and reference datasets. Can display results as elegant complex heatmaps or as information-rich summary boxplots. Optionally displays anomaly status when available.
calculateTopLoadingGeneShifts(
query_data,
reference_data,
query_cell_type_col,
ref_cell_type_col,
cell_types = NULL,
pc_subset = 1:5,
n_top_loadings = 50,
p_value_threshold = 0.05,
adjust_method = "fdr",
assay_name = "logcounts",
detect_anomalies = FALSE,
anomaly_threshold = 0.6,
n_tree = 500,
max_cells = 2500
)
# S3 method for class 'calculateTopLoadingGeneShiftsObject'
plot(
x,
cell_type,
pc_subset = 1:3,
plot_type = c("heatmap", "boxplot"),
plot_by = c("p_adjusted", "top_loading"),
n_genes = 10,
significance_threshold = 0.05,
show_anomalies = FALSE,
draw_plot = TRUE,
...
)
A SingleCellExperiment
object containing numeric expression matrix for the query cells.
A SingleCellExperiment
object containing numeric expression matrix for the reference cells.
The column name in the colData
of query_data
that identifies the cell types.
The column name in the colData
of reference_data
that identifies the cell types.
A character vector specifying the cell types to analyze. If NULL, all common cell types are used.
A numeric vector specifying which principal components to plot. Default is 1:3.
Number of top loading genes to analyze per PC. Default is 50.
P-value threshold for statistical significance. Default is 0.05.
Method for multiple testing correction. Default is "fdr".
Name of the assay on which to perform computations. Default is "logcounts".
Logical indicating whether to perform anomaly detection using isolation forests. Default is FALSE.
A numeric value specifying the threshold for identifying anomalies when
detect_anomalies
is TRUE. Default is 0.6.
An integer specifying the number of trees for the isolation forest when
detect_anomalies
is TRUE. Default is 500.
Maximum number of cells to retain. If the object has fewer cells, it is returned unchanged. Default is 2500.
An object of class calculateTopLoadingGeneShiftsObject
.
A character string specifying the cell type to plot (must be exactly one).
A character string specifying visualization type. Either "heatmap" or "boxplot". Default is "heatmap".
A character string specifying gene selection method when `n_genes` is not NULL. Either "top_loading" or "p_adjusted". Default is "p_adjusted".
Number of top genes to show per PC. Can be NULL if `significance_threshold` is set. Default is 10.
If not NULL, a numeric value between 0 and 1. Used for gene selection or annotation. Default is 0.05.
Logical indicating whether to display anomaly status annotations. Default is FALSE. Requires anomaly results to be present in the object.
Logical indicating whether to draw the plot immediately (TRUE) or return the undrawn plot object (FALSE). For heatmaps, FALSE returns a ComplexHeatmap object that can be further customized before drawing. Default is TRUE.
Additional arguments passed to draw
or not used for boxplot.
A list containing:
PC results: Named elements for each PC (e.g., "PC1", "PC2") containing data frames with gene-level analysis results.
expression_data: Matrix of expression values for all analyzed genes (genes × cells).
cell_metadata: Data frame with columns: cell_id, dataset, cell_type, original_index, and optionally anomaly_status.
gene_metadata: Data frame with columns: gene, pc, loading for all analyzed genes.
percent_var: Named numeric vector of global percent variance explained for each analyzed PC.
cell_type_variance: A data frame detailing the percent of variance a global PC explains within specific cell types for both query and reference datasets.
anomaly_results: If detect_anomalies
is TRUE, contains the full output from detectAnomaly
.
The `cell_type_variance` data frame contains columns: pc, cell_type, dataset, percent_variance. When anomaly detection is enabled, `cell_metadata` includes an additional `anomaly_status` column.
A plot object.
This function extracts the top loading genes for each specified principal component from the reference PCA space and performs distributional comparisons between query and reference data. For each gene, it performs statistical tests to identify genes that may be causing PC-specific alignment issues between datasets. A key feature is the calculation of cell-type-specific variance explained by global PCs, providing a more nuanced view of how major biological axes affect individual populations. When anomaly detection is enabled, isolation forests are used to identify anomalous cells based on their PCA projections.
This function visualizes the results from calculateTopLoadingGeneShifts
.
The "heatmap" option displays a hierarchically clustered set of genes.
The "boxplot" option creates a two-panel plot using `ggplot2`: the left panel shows
horizontal expression boxplots for up to 5 PCs, while the right panel displays their
corresponding PC loadings and adjusted p-values. When anomaly detection results are
available and show_anomalies
is TRUE, an additional annotation bar highlights
anomalous cells.
plot.calculateTopLoadingGeneShiftsObject
, detectAnomaly
calculateTopLoadingGeneShifts