Accessing data from the Human Cell Atlas (HCA)
Last updated on 2024-09-09 | Edit this page
Overview
Questions
- How to obtain single-cell reference maps from the Human Cell Atlas?
Objectives
- Learn about different resources for public single-cell RNA-seq data.
- Access data from the Human Cell Atlas using the
CuratedAtlasQueryR
package. - Query for cells of interest and download them into a
SingleCellExperiment
object.
HCA Project
The Human Cell Atlas (HCA) is a large project that aims to learn from and map every cell type in the human body. The project extracts spatial and molecular characteristics in order to understand cellular function and networks. It is an international collaborative that charts healthy cells in the human body at all ages. There are about 37.2 trillion cells in the human body. To read more about the project, head over to their website at https://www.humancellatlas.org.
CELLxGENE
CELLxGENE is a database and a suite of tools that help scientists to find, download, explore, analyze, annotate, and publish single cell data. It includes several analytic and visualization tools to help you to discover single cell data patterns. To see the list of tools, browse to https://cellxgene.cziscience.com/.
CELLxGENE | Census
The Census provides efficient computational tooling to access, query, and analyze all single-cell RNA data from CZ CELLxGENE Discover. Using a new access paradigm of cell-based slicing and querying, you can interact with the data through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus accelerating your research by significantly minimizing data harmonization at https://chanzuckerberg.github.io/cellxgene-census/.
The CuratedAtlasQueryR Project
The CuratedAtlasQueryR
is an alternative package that
can also be used to access the CELLxGENE data from R through a tidy API.
The data has also been harmonized, curated, and re-annotated across
studies.
CuratedAtlasQueryR
supports data access and programmatic
exploration of the harmonized atlas. Cells of interest can be selected
based on ontology, tissue of origin, demographics, and disease. For
example, the user can select CD4 T helper cells across healthy and
diseased lymphoid tissue. The data for the selected cells can be
downloaded locally into SingleCellExperiment objects. Pseudo bulk counts
are also available to facilitate large-scale, summary analyses of
transcriptional profiles.
Data Sources in R / Bioconductor
There are a few options to access single cell data with R / Bioconductor.
Package | Target | Description |
---|---|---|
hca | HCA Data Portal API | Project, Sample, and File level HCA data |
cellxgenedp | CellxGene | Human and mouse SC data including HCA |
CuratedAtlasQueryR | CellxGene | fine-grained query capable CELLxGENE data including HCA |
Installation
R
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CuratedAtlasQueryR")
Package load
R
library(CuratedAtlasQueryR)
library(dplyr)
HCA Metadata
The metadata allows the user to get a lay of the land of what is available via the package. In this example, we are using the sample database URL which allows us to get a small and quick subset of the available metadata.
R
metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) |>
collect()
Get a view of the first 10 columns in the metadata with
glimpse
R
metadata |>
select(1:10) |>
glimpse()
OUTPUT
Rows: ??
Columns: 10
Database: DuckDB v0.10.2 [unknown@Linux 6.5.0-1021-azure:R 4.4.0/:memory:]
$ cell_ <chr> "TTATGCTAGGGTGTTG_12", "GCTTGAACATGG…
$ sample_ <chr> "039c558ca1c43dc74c563b58fe0d6289", …
$ cell_type <chr> "mature NK T cell", "mature NK T cel…
$ cell_type_harmonised <chr> "immune_unclassified", "cd8 tem", "i…
$ confidence_class <dbl> 5, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, …
$ cell_annotation_azimuth_l2 <chr> "gdt", "cd8 tem", "cd8 tem", "cd8 te…
$ cell_annotation_blueprint_singler <chr> "cd4 tem", "cd8 tem", "cd8 tcm", "cl…
$ cell_annotation_monaco_singler <chr> "natural killer", "effector memory c…
$ sample_id_db <chr> "0c1d320a7d0cbbc281a535912722d272", …
$ `_sample_name` <chr> "BPH340PrSF_Via___transition zone of…
A note on the piping operator
The vignette materials provided by CuratedAtlasQueryR
show the use of the ‘native’ R pipe (implemented after R version
4.1.0
). For those not familiar with the pipe operator
(|>
), it allows you to chain functions by passing the
left-hand side as the first argument to the function on the right-hand
side. It is used extensively in the tidyverse
dialect of R,
especially within the dplyr
package.
The pipe operator can be read as “and then”. Thankfully, R doesn’t care about whitespace, so it’s common to start a new line after a pipe. Together these points enable users to “chain” complex sequences of commands into readable blocks.
In this example, we start with the built-in mtcars
dataset and then filter to rows where cyl
is not equal to
4, and then compute the mean disp
value by each unique
cyl
value.
R
mtcars |>
filter(cyl != 4) |>
summarise(avg_disp = mean(disp),
.by = cyl)
OUTPUT
cyl avg_disp
1 6 183.3143
2 8 353.1000
This command is equivalent to the following:
R
summarise(filter(mtcars, cyl != 4), mean_disp = mean(disp), .by = cyl)
Summarizing the metadata
For each distinct tissue and dataset combination, count the number of datasets by tissue type.
R
metadata |>
distinct(tissue, dataset_id) |>
count(tissue)
OUTPUT
# A tibble: 33 × 2
tissue n
<chr> <int>
1 adrenal gland 1
2 axilla 1
3 blood 17
4 bone marrow 4
5 caecum 1
6 caudate lobe of liver 1
7 cortex of kidney 7
8 dorsolateral prefrontal cortex 1
9 epithelium of esophagus 1
10 fovea centralis 1
# ℹ 23 more rows
Columns available in the metadata
R
head(names(metadata), 10)
OUTPUT
[1] "cell_" "sample_"
[3] "cell_type" "cell_type_harmonised"
[5] "confidence_class" "cell_annotation_azimuth_l2"
[7] "cell_annotation_blueprint_singler" "cell_annotation_monaco_singler"
[9] "sample_id_db" "_sample_name"
Challenge
Glance over the full list of metadata column names. Do any other metadata columns jump out as interesting to you for your work?
R
metadata |> names() |> sort()
Available assays
R
metadata |>
distinct(assay, dataset_id) |>
count(assay)
OUTPUT
# A tibble: 12 × 2
assay n
<chr> <int>
1 10x 3' v1 1
2 10x 3' v2 27
3 10x 3' v3 21
4 10x 5' v1 7
5 10x 5' v2 2
6 Drop-seq 1
7 Seq-Well 2
8 Slide-seq 4
9 Smart-seq2 1
10 Visium Spatial Gene Expression 7
11 scRNA-seq 4
12 sci-RNA-seq 1
Download single-cell RNA sequencing counts
The data can be provided as either “counts” or counts per million
“cpm” as given by the assays
argument in the
get_single_cell_experiment()
function. By default, the
SingleCellExperiment
provided will contain only the
‘counts’ data.
For the sake of demonstration, we’ll focus this small subset of samples:
R
sample_subset = metadata |>
filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
)
Query raw counts
R
single_cell_counts <- sample_subset |>
get_single_cell_experiment()
single_cell_counts
OUTPUT
class: SingleCellExperiment
dim: 36229 1571
metadata(0):
assays(1): counts
rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
colData names(56): sample_ cell_type ... updated_at_y original_cell_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Query counts scaled per million
This is helpful if just few genes are of interest, as they can be compared across samples.
R
sample_subset |>
get_single_cell_experiment(assays = "cpm")
OUTPUT
class: SingleCellExperiment
dim: 36229 1571
metadata(0):
assays(1): cpm
rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
colData names(56): sample_ cell_type ... updated_at_y original_cell_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Extract only a subset of genes
R
single_cell_counts <- sample_subset |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
single_cell_counts
OUTPUT
class: SingleCellExperiment
dim: 1 1571
metadata(0):
assays(1): cpm
rownames(1): PUM1
rowData names(0):
colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
colData names(56): sample_ cell_type ... updated_at_y original_cell_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Exercises
Exercise 1
Use count
and arrange
to get the number of
cells per tissue in descending order.
R
metadata |>
count(tissue) |>
arrange(-n)
Exercise 2
Use dplyr
-isms to group by tissue
and
cell_type
and get a tally of the highest number of cell
types per tissue combination. What tissue has the most numerous type of
cells?
R
metadata |>
count(tissue, cell_type) |>
arrange(-n)
Exercise 3
Spot some differences between the tissue
and
tissue_harmonised
columns. Use count
to
summarise.
R
metadata |>
count(tissue) |>
arrange(-n)
metadata |>
count(tissue_harmonised) |>
arrange(-n)
To see the full list of curated columns in the metadata, see the
Details section in the ?get_metadata
documentation
page.
Exercise 4
Now that we are a little familiar with navigating the metadata, let’s
obtain a SingleCellExperiment
of 10X scRNA-seq counts of
cd8 tem
lung
cells for females older than
80
with COVID-19
. Note: Use the harmonized
columns, where possible.
R
metadata |>
filter(
sex == "female" &
age_days > 80 * 365 &
stringr::str_like(assay, "%10x%") &
disease == "COVID-19" &
tissue_harmonised == "lung" &
cell_type_harmonised == "cd8 tem"
) |>
get_single_cell_experiment()
OUTPUT
class: SingleCellExperiment
dim: 36229 12
metadata(0):
assays(1): counts
rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(12): TCATCATCATAACCCA_1 TATCTGTCAGAACCGA_1 ...
CCCTTAGCATGACTTG_1 CAGTTCCGTAGCGTAG_1
colData names(56): sample_ cell_type ... updated_at_y original_cell_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Key Points
- The
CuratedAtlasQueryR
package provides programmatic access to single-cell reference maps from the Human Cell Atlas. - The package provides functionality to query for cells of interest
and to download them into a
SingleCellExperiment
object.