Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.
The group_by() function doesn’t perform any data processing, it groups the data into subsets: in the example above, our initial tibble of 32428 observations is split into 1474 groups based on the gene variable.
We could similarly decide to group the tibble by the samples:
Here our initial tibble of 32428 observations is split into 22 groups based on the sample variable.
Once the data has been grouped, subsequent operations will be applied on each group independently.
The summarise() function
group_by() is often used together with summarise(), which collapses each group into a single-row summary of that group.
group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean expression by gene:
Once the data is grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the median expression by gene and by condition:
# A tibble: 3 × 2
time mean
<dbl> <dbl>
1 0 169
2 4 156.
3 8 61
Counting
When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each infected and non-infected samples, we would do:
rna %>%count(infection)
# A tibble: 2 × 2
infection n
<chr> <int>
1 InfluenzaA 22110
2 NonInfected 10318
The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, rna %>% count(infection) is equivalent to:
rna %>%group_by(infection) %>%summarise(n =n())
# A tibble: 2 × 2
infection n
<chr> <int>
1 InfluenzaA 22110
2 NonInfected 10318
The previous example shows the use of count() to count the number of rows/observations for one factor (i.e., infection). If we wanted to count a combination of factors, such as infection and time, we would specify the first and the second factor as the arguments of count():
rna %>%count(infection, time)
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 InfluenzaA 4 11792
2 InfluenzaA 8 10318
3 NonInfected 0 10318
`summarise()` has grouped output by 'infection'. You can override using the
`.groups` argument.
# A tibble: 3 × 3
# Groups: infection [2]
infection time n
<chr> <dbl> <int>
1 InfluenzaA 4 11792
2 InfluenzaA 8 10318
3 NonInfected 0 10318
It is sometimes useful to sort the result to facilitate the comparisons. We can use arrange() to sort the table. For instance, we might want to arrange the table above by time:
rna %>%count(infection, time) %>%arrange(time)
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 NonInfected 0 10318
2 InfluenzaA 4 11792
3 InfluenzaA 8 10318
or by counts:
rna %>%count(infection, time) %>%arrange(n)
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 InfluenzaA 8 10318
2 NonInfected 0 10318
3 InfluenzaA 4 11792
To sort in descending order, we need to add the desc() function:
rna %>%count(infection, time) %>%arrange(desc(n))
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 InfluenzaA 4 11792
2 InfluenzaA 8 10318
3 NonInfected 0 10318
Challenge
How many genes were analysed in each sample?
Use group_by() and summarise() to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth?
Pick one sample and evaluate the number of genes by biotype.
Identify genes associated with the “abnormal DNA methylation” phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8.
The materials in this lesson have been adapted from work created by the HBC and Data Carpentry, as well as materials created by Laurent Gatto, Charlotte Soneson, Jenny Drnevich, Robert Castelo, and Kevin Rue-Albert. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.