This function ensures that SingleCellExperiment objects have PCA computed using highly variable genes when needed. It only performs downsampling when PCA computation is required, preserving existing PCA computations without modification.

processPCA(
  query_data = NULL,
  reference_data = NULL,
  max_cells = 5000,
  n_hvgs = 2000,
  assay_name = "logcounts"
)

Arguments

query_data

A SingleCellExperiment object for query data. If NULL, no processing is performed on query data. Default is NULL.

reference_data

A SingleCellExperiment object for reference data. If NULL, no processing is performed on reference data. Default is NULL.

max_cells

Maximum number of cells to retain per dataset when PCA computation is required. Objects with more cells will be randomly downsampled to this number before PCA computation. Objects with existing PCA are never downsampled. Default is 5000.

n_hvgs

Number of highly variable genes to select for PCA computation. When both datasets lack PCA, this number is used for each dataset before taking the intersection. Default is 2000.

assay_name

Name of the assay to use for HVG selection and PCA computation. Should contain log-normalized expression values. Default is "logcounts".

Value

When only one dataset is provided, returns the processed SingleCellExperiment object directly. When both datasets are provided, returns a list containing:

  • query_data: Processed query data

  • reference_data: Processed reference data

Each returned object will have:

  • PCA in reducedDims slot

  • Original cell count if PCA existed, or at most max_cells if PCA was computed

Details

The function performs the following operations:

  • Checks if PCA exists in the provided SingleCellExperiment objects

  • If PCA already exists, returns the object unchanged (no downsampling)

  • If PCA is missing and dataset is large, downsamples before computing PCA

  • Computes PCA using highly variable genes when PCA is missing

  • When both datasets are provided and one has existing PCA, uses the genes from the existing PCA for the other dataset

  • When both datasets lack PCA, uses the intersection of highly variable genes for consistent PCA spaces

  • Utilizes scran for HVG selection and scater for PCA computation (soft dependencies)

The downsampling strategy uses random sampling without replacement and only occurs when PCA computation is necessary. This preserves expensive pre-computed PCA results while ensuring computational efficiency for new PCA computations.

For datasets where one has existing PCA and the other doesn't, the function uses the genes from the existing PCA rotation matrix to ensure compatible PCA spaces. When both datasets lack PCA, it uses the intersection of highly variable genes from both datasets.

Note

This function requires the scran and scater packages for HVG selection and PCA computation. These packages should be installed via BiocManager::install(c("scran", "scater")).

Objects with existing PCA are returned unchanged to preserve expensive pre-computations. Only datasets requiring PCA computation are subject to downsampling.

Examples

# Example 1: Process single dataset
library(TENxPBMCData)
#> Loading required package: HDF5Array
#> Loading required package: SparseArray
#> Loading required package: Matrix
#> 
#> Attaching package: ‘Matrix’
#> The following object is masked from ‘package:S4Vectors’:
#> 
#>     expand
#> Loading required package: S4Arrays
#> Loading required package: abind
#> 
#> Attaching package: ‘S4Arrays’
#> The following object is masked from ‘package:abind’:
#> 
#>     abind
#> The following object is masked from ‘package:base’:
#> 
#>     rowsum
#> Loading required package: DelayedArray
#> 
#> Attaching package: ‘DelayedArray’
#> The following objects are masked from ‘package:base’:
#> 
#>     apply, scale, sweep
#> Loading required package: h5mread
#> Loading required package: rhdf5
#> 
#> Attaching package: ‘h5mread’
#> The following object is masked from ‘package:rhdf5’:
#> 
#>     h5ls
library(scuttle)

# Load and prepare dataset
pbmc_data <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
pbmc_subset <- pbmc_data[, 1:500]  # Use subset for example
pbmc_subset <- logNormCounts(pbmc_subset)

# Remove any existing PCA
reducedDims(pbmc_subset) <- list()

# Process dataset - will compute PCA using HVGs
processed_data <- processPCA(query_data = pbmc_subset, n_hvgs = 1000)
#> Computing PCA for query data (500 cells)

# Check results
"PCA" %in% reducedDimNames(processed_data)  # Should be TRUE
#> [1] TRUE
ncol(processed_data)  # Should be 500 (unchanged)
#> [1] 500

# Example 2: Process both query and reference datasets
# Create reference dataset
pbmc_ref <- pbmc_data[, 501:1000]
pbmc_ref <- logNormCounts(pbmc_ref)
reducedDims(pbmc_ref) <- list()

# Process both datasets - will use intersection of HVGs
result <- processPCA(
  query_data = pbmc_subset,
  reference_data = pbmc_ref,
  max_cells = 10000,
  n_hvgs = 800
)
#> Using 328 common HVGs for both datasets
#> Computing PCA for query data (500 cells)
#> Computing PCA for reference data (500 cells)

# Check results
"PCA" %in% reducedDimNames(result$query_data)      # Should be TRUE
#> [1] TRUE
"PCA" %in% reducedDimNames(result$reference_data)  # Should be TRUE
#> [1] TRUE
ncol(result$query_data)      # Should be 500
#> [1] 500
ncol(result$reference_data)  # Should be 500
#> [1] 500