This function identifies genes with the highest loadings for specified principal components and performs statistical tests to detect distributional differences between query and reference data. It also calculates the proportion of variance explained by each principal component within specific cell types.

This function creates visualizations showing expression distributions for top loading genes that exhibit distributional differences between query and reference datasets. Can display results as elegant complex heatmaps or as information-rich summary boxplots.

calculateTopLoadingGeneShifts(
  query_data,
  reference_data,
  query_cell_type_col,
  ref_cell_type_col,
  cell_types = NULL,
  pc_subset = 1:5,
  n_top_loadings = 50,
  p_value_threshold = 0.05,
  adjust_method = "fdr",
  assay_name = "logcounts",
  max_cells = 2500
)

# S3 method for class 'calculateTopLoadingGeneShiftsObject'
plot(
  x,
  cell_type,
  pc_subset = 1:3,
  plot_type = c("heatmap", "boxplot"),
  plot_by = c("p_adjusted", "top_loading"),
  n_genes = 10,
  significance_threshold = 0.05,
  ...
)

Arguments

query_data

A SingleCellExperiment object containing numeric expression matrix for the query cells.

reference_data

A SingleCellExperiment object containing numeric expression matrix for the reference cells.

query_cell_type_col

The column name in the colData of query_data that identifies the cell types.

ref_cell_type_col

The column name in the colData of reference_data that identifies the cell types.

cell_types

A character vector specifying the cell types to analyze. If NULL, all common cell types are used.

pc_subset

A numeric vector specifying which principal components to plot. Default is 1:3.

n_top_loadings

Number of top loading genes to analyze per PC. Default is 50.

p_value_threshold

P-value threshold for statistical significance. Default is 0.05.

adjust_method

Method for multiple testing correction. Default is "fdr".

assay_name

Name of the assay on which to perform computations. Default is "logcounts".

max_cells

Maximum number of cells to retain. If the object has fewer cells, it is returned unchanged. Default is 2500.

x

An object of class calculateTopLoadingGeneShiftsObject.

cell_type

A character string specifying the cell type to plot (must be exactly one).

plot_type

A character string specifying visualization type. Either "heatmap" or "boxplot". Default is "heatmap".

plot_by

A character string specifying gene selection method when `n_genes` is not NULL. Either "top_loading" or "p_adjusted". Default is "p_adjusted".

n_genes

Number of top genes to show per PC. Can be NULL if `significance_threshold` is set. Default is 10.

significance_threshold

If not NULL, a numeric value between 0 and 1. Used for gene selection or annotation. Default is 0.05.

...

Additional arguments passed to draw or not used for boxplot.

Value

A list containing:

  • PC results: Named elements for each PC (e.g., "PC1", "PC2") containing data frames with gene-level analysis results.

  • expression_data: Matrix of expression values for all analyzed genes (genes × cells).

  • cell_metadata: Data frame with columns: cell_id, dataset, cell_type, original_index.

  • gene_metadata: Data frame with columns: gene, pc, loading for all analyzed genes.

  • percent_var: Named numeric vector of global percent variance explained for each analyzed PC.

  • cell_type_variance: A data frame detailing the percent of variance a global PC explains within specific cell types for both query and reference datasets.

The `cell_type_variance` data frame contains columns: pc, cell_type, dataset, percent_variance.

A plot object.

Details

This function extracts the top loading genes for each specified principal component from the reference PCA space and performs distributional comparisons between query and reference data. For each gene, it performs statistical tests to identify genes that may be causing PC-specific alignment issues between datasets. A key feature is the calculation of cell-type-specific variance explained by global PCs, providing a more nuanced view of how major biological axes affect individual populations.

This function visualizes the results from calculateTopLoadingGeneShifts. The "heatmap" option displays a hierarchically clustered set of genes. The "boxplot" option creates a two-panel plot using `ggplot2`: the left panel shows horizontal expression boxplots for up to 5 PCs, while the right panel displays their corresponding PC loadings and adjusted p-values.

See also

plot.calculateTopLoadingGeneShiftsObject

calculateTopLoadingGeneShifts