COVID-19 PBMC Processing

Overview

This page describes the data processing workflow for the COVID-19 PBMC dataset. The raw data has been obtained from CZI CELLxGENE and undergoes two processing steps to generate the final reference and query datasets used in the analysis.

Data Source

The raw dataset can be obtained from CZI CELLxGENE:

Collection: COVID-19 PBMC Single-Cell Data

Format: H5AD file (covid_data.h5ad)

The dataset contains monocytes from both healthy controls and severe COVID-19 patients, profiled using 10x Genomics scRNA-seq.

Processing Pipeline

The workflow consists of serial execution of two R scripts:

Step 1: Data Cleaning and Subsetting

File: R/covid/Data_Cleaning_and_Metadata_Removal.R

This script performs initial data cleaning and filtering:

Load raw H5AD file — Uses zellkonverter to read the h5ad in memory-efficient mode
Remove pre-computed metadata — Strips unnecessary metadata, neighbors, PCA and other slots to create a clean object
Filter for relevant samples — Retains only:
- Healthy controls (normal disease status)
- COVID-19 patients (severe infection)
- Excludes other disease states and LPS-treated controls
Stratified sampling — Samples cells proportionally:
- ~1,000 cells per healthy control sample (reference)
- ~500 cells per COVID-19 sample (query)
Convert to sparse matrix — Realizes the delayed matrix into memory-efficient sparse format

Output: data/covid/covid_data_clean.rds (~0.3 GB compressed)

Step 2: Quality Control, Processing, and Splitting

File: R/covid/Data_QC_and_Processing.R

This script applies QC filters, normalizes data, and prepares reference/query splits:

QC filtering:
- Cells: Keep cells with ≥1000 total UMI counts
- Genes: Keep genes expressed in ≥10 cells
Assay naming and pseudo-count generation:
- Rename primary assay to logcounts (data is already log-normalized)
- Generate counts assay by reversing log-transformation: 2^logcounts - 1
Gene ID conversion:
- Convert Ensembl gene IDs to HGNC gene symbols using the convertEnsemblToSymbols() function
- Leverages biomaRt to fetch HGNC symbols
- Handles missing symbols and duplicates by making gene names unique
Dataset splitting:
- Reference: Healthy control cells
- Query: COVID-19 patient cells
Diagnostic-aware PCA:
- Uses the addReferencePCA() function to identify highly variable genes (HVGs) in both reference and query
- Takes the union of HVGs to capture both stable and disease-specific signals
- Runs PCA on this shared feature space (50 components)
Cell type merging:
- Uses the addMergedCellTypes() function to collapse fine-grained author-provided annotations into broader categories
- Enables comparison across different annotation granularities
- COVID-specific mapping groups cell types like CD14 monocytes, CD16 monocytes, CD4 T cells, CD8 T cells, etc.

Output:

data/covid/normal_data_sce.rds — Reference (healthy) dataset
data/covid/covid_data_sce.rds — Query (COVID-19) dataset

Final Datasets

Both output objects are in SingleCellExperiment format with the following components:

Assays: counts (pseudo-counts) and logcounts (log-normalized)
Metadata: Cell-level annotations including:
- Disease status (normal vs. COVID-19)
- Sample ID, cell type, and other clinical metadata
Reduced dimensions: PCA (50 components) computed on selected HVGs
Row data: Gene symbols and metadata

Dimensions:

Reference: 23,201 cells; 18,472 genes
Query: 48,148 cells; 18,472 genes

Downloading Pre-processed Data

Rather than running this pipeline from scratch (slow), fully processed datasets can be obtained directly following the instructions on the Data retrieval page (fast).

Full Scripts

For complete implementation details, see:

Step 1: R/covid/Data_Cleaning_and_Metadata_Removal.R
Step 2: R/covid/Data_QC_and_Processing.R

Supporting functions:

R/auxiliary/convertEnsemblToSymbols.R — Gene ID conversion via biomaRt
R/auxiliary/addReferencePCA.R — Diagnostic-aware PCA using union of reference and query HVGs
R/auxiliary/addMergedCellTypes.R — Cell type categorization with COVID-specific mappings