This function ensures that a SingleCellExperiment
object has valid PCA computed using
highly variable genes when needed. It only performs downsampling when PCA computation
is required, preserving existing valid PCA computations without modification.
processPCA(
sce_object,
assay_name = "logcounts",
n_hvgs = 2000,
max_cells = NULL
)
A SingleCellExperiment
object to process.
Name of the assay to use for HVG selection and PCA computation. Should contain log-normalized expression values. Default is "logcounts".
Number of highly variable genes to select for PCA computation. Default is 2000.
Maximum number of cells to retain if downsampling is needed for PCA computation. If NULL, no downsampling is performed. Default is NULL.
A SingleCellExperiment
object with valid PCA in the reducedDims slot,
including rotation matrix and percentVar attributes. Will have original cell count if PCA was valid,
or at most max_cells if PCA was computed.
The function performs the following operations:
Checks if PCA exists and is valid in the provided SingleCellExperiment
object
Validates PCA integrity including rotation matrix, percentVar, gene consistency, and dimensions
If PCA is valid, returns the object unchanged (no downsampling)
If PCA is missing or invalid and dataset is large, downsamples before computing PCA
Computes PCA using highly variable genes when PCA is missing or invalid
Utilizes scran for HVG selection and scater for PCA computation (soft dependencies)
The downsampling strategy uses random sampling without replacement and only occurs when PCA computation is necessary. This preserves expensive pre-computed PCA results while ensuring computational efficiency for new PCA computations.
PCA validation includes checking for:
Presence of PCA in reducedDims
Existence of rotation matrix and percentVar attributes
Gene consistency between rotation matrix and current assay
Dimension consistency between PCA coordinates and cell count
This function requires the scran and scater packages for HVG selection and PCA computation. These packages should be installed via BiocManager::install(c("scran", "scater")).
Objects with existing valid PCA are returned unchanged to preserve expensive pre-computations. Only datasets requiring PCA computation are subject to downsampling.
# Load and prepare dataset
library(TENxPBMCData)
#> Loading required package: HDF5Array
#> Loading required package: SparseArray
#> Loading required package: Matrix
#>
#> Attaching package: ‘Matrix’
#> The following object is masked from ‘package:S4Vectors’:
#>
#> expand
#> Loading required package: S4Arrays
#> Loading required package: abind
#>
#> Attaching package: ‘S4Arrays’
#> The following object is masked from ‘package:abind’:
#>
#> abind
#> The following object is masked from ‘package:base’:
#>
#> rowsum
#> Loading required package: DelayedArray
#>
#> Attaching package: ‘DelayedArray’
#> The following objects are masked from ‘package:base’:
#>
#> apply, scale, sweep
#> Loading required package: h5mread
#> Loading required package: rhdf5
#>
#> Attaching package: ‘h5mread’
#> The following object is masked from ‘package:rhdf5’:
#>
#> h5ls
library(scuttle)
pbmc_data <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
pbmc_subset <- pbmc_data[, 1:500]
pbmc_subset <- logNormCounts(pbmc_subset)
# Remove any existing PCA
reducedDims(pbmc_subset) <- list()
# Process dataset - will compute PCA using HVGs
processed_data <- processPCA(sce_object = pbmc_subset, n_hvgs = 1000)
#> Data missing PCA - computing...
#> Computing PCA...
#> Using 1000 highly variable genes for PCA computation
# Check results
"PCA" %in% reducedDimNames(processed_data) # Should be TRUE
#> [1] TRUE
ncol(processed_data) # Should be 500 (unchanged)
#> [1] 500