This function ensures that a SingleCellExperiment object has valid PCA computed using highly variable genes when needed. It only performs downsampling when PCA computation is required, preserving existing valid PCA computations without modification.

processPCA(
  sce_object,
  assay_name = "logcounts",
  n_hvgs = 2000,
  max_cells = NULL
)

Arguments

sce_object

A SingleCellExperiment object to process.

assay_name

Name of the assay to use for HVG selection and PCA computation. Should contain log-normalized expression values. Default is "logcounts".

n_hvgs

Number of highly variable genes to select for PCA computation. Default is 2000.

max_cells

Maximum number of cells to retain if downsampling is needed for PCA computation. If NULL, no downsampling is performed. Default is NULL.

Value

A SingleCellExperiment object with valid PCA in the reducedDims slot, including rotation matrix and percentVar attributes. Will have original cell count if PCA was valid, or at most max_cells if PCA was computed.

Details

The function performs the following operations:

  • Checks if PCA exists and is valid in the provided SingleCellExperiment object

  • Validates PCA integrity including rotation matrix, percentVar, gene consistency, and dimensions

  • If PCA is valid, returns the object unchanged (no downsampling)

  • If PCA is missing or invalid and dataset is large, downsamples before computing PCA

  • Computes PCA using highly variable genes when PCA is missing or invalid

  • Utilizes scran for HVG selection and scater for PCA computation (soft dependencies)

The downsampling strategy uses random sampling without replacement and only occurs when PCA computation is necessary. This preserves expensive pre-computed PCA results while ensuring computational efficiency for new PCA computations.

PCA validation includes checking for:

  • Presence of PCA in reducedDims

  • Existence of rotation matrix and percentVar attributes

  • Gene consistency between rotation matrix and current assay

  • Dimension consistency between PCA coordinates and cell count

Note

This function requires the scran and scater packages for HVG selection and PCA computation. These packages should be installed via BiocManager::install(c("scran", "scater")).

Objects with existing valid PCA are returned unchanged to preserve expensive pre-computations. Only datasets requiring PCA computation are subject to downsampling.

Examples

# Load and prepare dataset
library(TENxPBMCData)
#> Loading required package: HDF5Array
#> Loading required package: SparseArray
#> Loading required package: Matrix
#> 
#> Attaching package: ‘Matrix’
#> The following object is masked from ‘package:S4Vectors’:
#> 
#>     expand
#> Loading required package: S4Arrays
#> Loading required package: abind
#> 
#> Attaching package: ‘S4Arrays’
#> The following object is masked from ‘package:abind’:
#> 
#>     abind
#> The following object is masked from ‘package:base’:
#> 
#>     rowsum
#> Loading required package: DelayedArray
#> 
#> Attaching package: ‘DelayedArray’
#> The following objects are masked from ‘package:base’:
#> 
#>     apply, scale, sweep
#> Loading required package: h5mread
#> Loading required package: rhdf5
#> 
#> Attaching package: ‘h5mread’
#> The following object is masked from ‘package:rhdf5’:
#> 
#>     h5ls
library(scuttle)

pbmc_data <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
pbmc_subset <- pbmc_data[, 1:500]
pbmc_subset <- logNormCounts(pbmc_subset)

# Remove any existing PCA
reducedDims(pbmc_subset) <- list()

# Process dataset - will compute PCA using HVGs
processed_data <- processPCA(sce_object = pbmc_subset, n_hvgs = 1000)
#> Data missing PCA - computing...
#> Computing PCA...
#> Using 1000 highly variable genes for PCA computation

# Check results
"PCA" %in% reducedDimNames(processed_data)  # Should be TRUE
#> [1] TRUE
ncol(processed_data)  # Should be 500 (unchanged)
#> [1] 500