Working with summarized experimental data

This section introduces another broadly useful package and data structure, the SummarizedExperiment package and SummarizedExperiment object.

The SummarizedExperiment object has matrix-like properties – it has two dimensions and can be subset by ‘rows’ and ‘columns’. The assay() data of a SummarizedExperiment experiment contains one or more matrix-like objects where rows represent features of interest (e.g., genes), columns represent samples, and elements of the matrix represent results of a genomic assay (e.g., counts of reads overlaps genes in each sample of an bulk RNA-seq differential expression assay.

Object construction

The SummarizedExperiment coordinates assays with (optional) descriptions of rows and columns. We start by reading in a simple data.frame describing 8 samples from an RNASeq experiment looking at dexamethasone treatment across 4 human smooth muscle cell lines; use browseVignettes("airway") for a more complete description of the experiment and data processing. Read the column data in using file.choose() and read.csv().

#fname <- file.choose()  # airway_colData.csv
#fname

We want the first column the the data to be treated as row names (sample identifiers) in the data.frame, so read.csv() has an extra argument to indicate this.

colData <- read.csv(fname, row.names = 1)
head(colData)

           SampleName    cell   dex albut        Run avgLength Experiment
SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
              Sample    BioSample
SRR1039508 SRS508568 SAMN02422669
SRR1039509 SRS508567 SAMN02422675
SRR1039512 SRS508571 SAMN02422678
SRR1039513 SRS508572 SAMN02422670
SRR1039516 SRS508575 SAMN02422682
SRR1039517 SRS508576 SAMN02422673

The data are from the Short Read Archive, and the row names, SampleName, Run, Experiment, Sampel, and BioSample columns are classifications from the archive. Additional columns include:

cell: the cell line used. There are four cell lines.
dex: whether the sample was untreated, or treated with dexamethasone.
albut: a second treatment, which we ignore
avgLength: the sample-specific average length of the RNAseq reads estimated in the experiment.

Assay data

Now import the assay data from the file “airway_counts.csv”

fname <- file.choose()  # airway_counts.csv
fname

counts <- read.csv(fname, row.names=1)

Although the data are read as a data.frame, all columns are of the same type (integer-valued) and represent the same attribute; the data is really a matrix rather than data.frame, so we coerce to matrix using as.matrix().

counts <- as.matrix(counts)

We see the dimensions and first few rows of the counts matrix

dim(counts)

[1] 33469     8

head(counts)

                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000419        467        515        621        365        587
ENSG00000000457        260        211        263        164        245
ENSG00000000460         60         55         40         35         78
ENSG00000000938          0          0          2          0          1
ENSG00000000971       3251       3679       6177       4252       6721
                SRR1039517 SRR1039520 SRR1039521
ENSG00000000003       1047        770        572
ENSG00000000419        799        417        508
ENSG00000000457        331        233        229
ENSG00000000460         63         76         60
ENSG00000000938          0          0          0
ENSG00000000971      11027       5176       7995

It’s interesting to think about what the counts mean – for ENSG00000000003, sample SRR1039508 had 679 reads that overlapped this gene, sample SRR1039509 had 448 reads, etc. Notice that for this gene there seems to be a consistent pattern – within a cell line, the read counts in the untreated group are always larger than the read counts for the treated group. This and other basic observations from ‘looking at’ the data motivate many steps in a rigorous RNASeq differential expression analysis.

Creating a SummarizedExperiment object

There is considerable value in tightly coupling of the column data with the assay data, as it reduces the chances of bookkeeping errors as we work with our data.

Attach the SummarizedExperiment library to our R session.

library("SummarizedExperiment")

Warning: package 'GenomeInfoDb' was built under R version 4.3.3

Use the SummarizedExperiment() function to coordinate the assay and column data; this function uses row and column names to make sure the correct assay columns are described by the correct column data rows.

se <- SummarizedExperiment(assay = list(count=counts), colData = colData)
se

class: SummarizedExperiment 
dim: 33469 8 
metadata(0):
assays(1): count
rownames(33469): ENSG00000000003 ENSG00000000419 ... ENSG00000273492
  ENSG00000273493
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

It is straight-forward to use subset() on SummarizedExperiment to create subsets of the data in a coordinated way. Remember that a SummarizedExperiment is conceptually two-dimensional (matrix-like), and in the example below we are subsetting on the second dimension.

subset(se, , dex == "trt")

class: SummarizedExperiment 
dim: 33469 4 
metadata(0):
assays(1): count
rownames(33469): ENSG00000000003 ENSG00000000419 ... ENSG00000273492
  ENSG00000273493
rowData names(0):
colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

There are also accessors that extract data from the SummarizedExperiment. For instance, we can use assay() to extract the count matrix, and colSums() to calculate the library size (total number of reads overlapping genes in each sample).

colSums(assay(se))

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
  20637971   18809481   25348649   15163415   24448408   30818215   19126151 
SRR1039521 
  21164133

Note that library sizes differ by a factor of 2 from largest to smallest; how would this influence the interpretation of counts in individual cells of the assay data?

It might be useful to remember important computations in a way that is robust, e.g.,

se$lib.size <- colSums(assay(se))
colData(se)

DataFrame with 8 rows and 10 columns
            SampleName        cell         dex       albut         Run
           <character> <character> <character> <character> <character>
SRR1039508  GSM1275862      N61311       untrt       untrt  SRR1039508
SRR1039509  GSM1275863      N61311         trt       untrt  SRR1039509
SRR1039512  GSM1275866     N052611       untrt       untrt  SRR1039512
SRR1039513  GSM1275867     N052611         trt       untrt  SRR1039513
SRR1039516  GSM1275870     N080611       untrt       untrt  SRR1039516
SRR1039517  GSM1275871     N080611         trt       untrt  SRR1039517
SRR1039520  GSM1275874     N061011       untrt       untrt  SRR1039520
SRR1039521  GSM1275875     N061011         trt       untrt  SRR1039521
           avgLength  Experiment      Sample    BioSample  lib.size
           <integer> <character> <character>  <character> <numeric>
SRR1039508       126   SRX384345   SRS508568 SAMN02422669  20637971
SRR1039509       126   SRX384346   SRS508567 SAMN02422675  18809481
SRR1039512       126   SRX384349   SRS508571 SAMN02422678  25348649
SRR1039513        87   SRX384350   SRS508572 SAMN02422670  15163415
SRR1039516       120   SRX384353   SRS508575 SAMN02422682  24448408
SRR1039517       126   SRX384354   SRS508576 SAMN02422673  30818215
SRR1039520       101   SRX384357   SRS508579 SAMN02422683  19126151
SRR1039521        98   SRX384358   SRS508580 SAMN02422677  21164133

Exercises

Basic

Subset the SummarizedExperiment object se created above to genes (rows) with at least 8 reads mapped across all samples.

Hint: use rowSums.

Scale the read counts by library size, i.e. divide each column of assay(se) by the corresponding lib.size of each sample (column). Multiply the resulting scaled counts by 10^6 to obtain counts per million reads mapped.

Solution

#1
#Method 1
rs <- rowSums(assay(se))
se.sub <- subset(se, rs>7,)
#Method 2
ind <- rowSums(assay(se)) >= 8
se.sub <- se[ind,]
#2
# scale by library size
# option 1: loop  
for(i in 1:8)
  assay(se.sub)[,i] <- assay(se.sub)[,i] / se$lib.size[i]
assay(se.sub)[1:5,1:5]

                  SRR1039508   SRR1039509   SRR1039512   SRR1039513
ENSG00000000003 3.290052e-05 2.381778e-05 3.443971e-05 2.690687e-05
ENSG00000000419 2.262819e-05 2.737981e-05 2.449835e-05 2.407109e-05
ENSG00000000457 1.259814e-05 1.121775e-05 1.037531e-05 1.081551e-05
ENSG00000000460 2.907263e-06 2.924057e-06 1.577993e-06 2.308187e-06
ENSG00000000971 1.575252e-04 1.955929e-04 2.436816e-04 2.804118e-04
                  SRR1039516
ENSG00000000003 4.654700e-05
ENSG00000000419 2.400974e-05
ENSG00000000457 1.002110e-05
ENSG00000000460 3.190392e-06
ENSG00000000971 2.749054e-04

# option 2: vectorized
se.sub <- se[ind,]
tassay <- t(assay(se.sub)) / se$lib.size
assay(se.sub) <- t(tassay)
assay(se.sub)[1:5,1:5]

                  SRR1039508   SRR1039509   SRR1039512   SRR1039513
ENSG00000000003 3.290052e-05 2.381778e-05 3.443971e-05 2.690687e-05
ENSG00000000419 2.262819e-05 2.737981e-05 2.449835e-05 2.407109e-05
ENSG00000000457 1.259814e-05 1.121775e-05 1.037531e-05 1.081551e-05
ENSG00000000460 2.907263e-06 2.924057e-06 1.577993e-06 2.308187e-06
ENSG00000000971 1.575252e-04 1.955929e-04 2.436816e-04 2.804118e-04
                  SRR1039516
ENSG00000000003 4.654700e-05
ENSG00000000419 2.400974e-05
ENSG00000000457 1.002110e-05
ENSG00000000460 3.190392e-06
ENSG00000000971 2.749054e-04

# counts per million reads mapped
assay(se.sub) <- assay(se.sub) * 10^6
assay(se.sub)[1:5,1:5]

                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003  32.900521  23.817776  34.439705  26.906868  46.546998
ENSG00000000419  22.628193  27.379809  24.498347  24.071095  24.009743
ENSG00000000457  12.598138  11.217747  10.375306  10.815506  10.021102
ENSG00000000460   2.907263   2.924057   1.577993   2.308187   3.190392
ENSG00000000971 157.525175 195.592850 243.681626 280.411767 274.905425

ind <- rowSums(assay(se)) >= 8
se.sub <- se[ind,]
nrow(se.sub)

[1] 23171

Advanced

Carry out a t-test for each of the first 100 genes. Test for differences in mean read count per million reads mapped between the dexamethasone treated and untreated sample group. Annotate the resulting p-values as a new column in the rowData slot of your SummarizedExperiment.

Hint: use apply and t.test.

Solution

se.sub <- se[1:100,]

# logical index of treatment
trt <- se.sub$dex == "trt"

# computing the p-value for a single gene (row)
getPValue <- function(row)
{  
  tt <- t.test(row[trt], row[!trt])  
  p <- tt$p.value
  return(p)
}

# calculate t-test p-values for all genes
ps <- apply(assay(se.sub), 1, getPValue)

# annotate to SE
rowData(se.sub)$pvalue <- ps
rowData(se.sub)

DataFrame with 100 rows and 1 column
                   pvalue
                <numeric>
ENSG00000000003  0.220827
ENSG00000000419  0.827554
ENSG00000000457  0.674747
ENSG00000000460  0.384125
ENSG00000000938  0.215170
...                   ...
ENSG00000005486  0.588909
ENSG00000005513  0.144668
ENSG00000005700  0.559485
ENSG00000005801  0.181433
ENSG00000005810  0.656287

This lesson was adapted from materials created by Ludwig Geistlinger