Bootcamp Part 2: Starting with Data

Dataframes

We will begin with a mostly processed dataset from NHANES designed to examine the relationship between diabetes and dental caries in adolescents. Specifically, this dataset contains the 3346 adolescents recorded in NHANES from 2005 to 2010 with non-missing dental decay data.

We can load the data into R as a dataframe using the read.csv function.

nhanes <- read.csv("https://raw.githubusercontent.com/ccb-hms/hsdm-r-course/main/session-materials/session0/session0Data.csv")

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name:

nhanes

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head():

head(nhanes)
  sequence.id age.years    sex                      ethnicity
1       31521        13 Female             Non-Hispanic White
2       31981        15 Female             Non-Hispanic Black
3       32326        16   Male             Non-Hispanic Black
4       32369        15   Male Other Race - Including Multi-R
5       32537        13 Female               Mexican American
6       32604        15 Female             Non-Hispanic White
                     birthplace family.PIR dental.decay.present
1 Born in 50 US States or Washi       5.00                FALSE
2 Born in 50 US States or Washi       0.67                FALSE
3 Born in 50 US States or Washi       1.24                FALSE
4 Born in 50 US States or Washi       0.93                FALSE
5 Born in 50 US States or Washi       4.13                FALSE
6 Born in 50 US States or Washi       5.00                FALSE
  dental.restoration.present plasma.glucose hba1c   bmi
1                      FALSE             88   5.1 17.62
2                      FALSE             94   4.7 20.65
3                      FALSE             NA   6.2 35.62
4                       TRUE             NA   5.3 27.72
5                      FALSE             NA   5.1 19.43
6                      FALSE             NA   5.0 21.22

What are data frames?

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors).

We can see this when inspecting the structure of a data frame with the function str():

str(nhanes)
'data.frame':   3346 obs. of  11 variables:
 $ sequence.id               : int  31521 31981 32326 32369 32537 32604 31129 31148 31347 31458 ...
 $ age.years                 : int  13 15 16 15 13 15 15 16 16 13 ...
 $ sex                       : chr  "Female" "Female" "Male" "Male" ...
 $ ethnicity                 : chr  "Non-Hispanic White" "Non-Hispanic Black" "Non-Hispanic Black" "Other Race - Including Multi-R" ...
 $ birthplace                : chr  "Born in 50 US States or Washi" "Born in 50 US States or Washi" "Born in 50 US States or Washi" "Born in 50 US States or Washi" ...
 $ family.PIR                : num  5 0.67 1.24 0.93 4.13 5 2.71 1.06 3.1 3.76 ...
 $ dental.decay.present      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ dental.restoration.present: logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ plasma.glucose            : int  88 94 NA NA NA NA NA 91 NA NA ...
 $ hba1c                     : num  5.1 4.7 6.2 5.3 5.1 5 5.2 5 5.1 5.1 ...
 $ bmi                       : num  17.6 20.6 35.6 27.7 19.4 ...

Inspecting data.frame Objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

Size:

  • dim(nhanes) - returns a vector with the number of rows as the first element, and the number of columns as the second element (the dimensions of the object).
  • nrow(nhanes) - returns the number of rows.
  • ncol(nhanes) - returns the number of columns.

Content:

  • head(nhanes) - shows the first 6 rows.
  • tail(nhanes) - shows the last 6 rows.

Names:

  • names(nhanes) - returns the column names (synonym of colnames() for data.frame objects).
  • rownames(nhanes) - returns the row names.

Summary:

  • str(nhanes) - structure of the object and information about the class, length and content of each column.
  • summary(nhanes) - summary statistics for each column.

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Packages

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.

Some packages can greatly facilitate our task when we manipulate data. Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; Loading packages can give you access to other specific functions. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.

  • The package dplyr provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks optimised.

  • As we will see latter on, sometimes we want a data frame to be reshaped to be able to do some specific analyses or for visualisation. The package tidyr addresses this common problem of reshaping data and provides tools for manipulating data in a tidy way.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

  • The tidyverse package is an “umbrella-package” that installs several useful packages for data analysis which work well together, such as tidyr, dplyr, ggplot2, tibble, etc. These packages help us to work and interact with the data. They allow us to do many things with your data, such as subsetting, transforming, visualising, etc.

Packages can be installed using the install.packages command. This downloads and installs the package into your entire R installation. This means that you would not need to re-install the package for a new project.

install.packages("tidyverse")

Once a package is installed, it can be loaded using the library function. This tells R to use all of the functions inside the loaded package. library loads packages into your R session, and thus this line needs to be run each time you open R.

## load the tidyverse packages, incl. dplyr
library("tidyverse")
library("DT")

Loading data with tidyverse

Instead of read.csv(), we will read in our data using the read_csv() function (notice the _ instead of the .), from the tidyverse package readr.

nhanes <- read_csv("https://raw.githubusercontent.com/ccb-hms/hsdm-r-course/main/session-materials/session0/session0Data.csv")

## view the data
datatable(nhanes)

Notice that the class of the data is now referred to as a “tibble”.

Tibbles tweak some of the behaviors of the data frame objects we introduced in the previously. The data structure is very similar to a data frame. For our purposes the only differences are that:

  1. It displays the data type of each column under its name. Note that <dbl> is a data type defined to hold numeric values with decimal points.

  2. It only prints the first few rows of data and only as many columns as fit on one screen.

We are now going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarise(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (nhanes), and the subsequent arguments are the columns to keep.

select(nhanes, sex, age.years, dental.decay.present, hba1c)
# A tibble: 3,346 × 4
   sex    age.years dental.decay.present hba1c
   <chr>      <dbl> <lgl>                <dbl>
 1 Female        13 FALSE                  5.1
 2 Female        15 FALSE                  4.7
 3 Male          16 FALSE                  6.2
 4 Male          15 FALSE                  5.3
 5 Female        13 FALSE                  5.1
 6 Female        15 FALSE                  5  
 7 Male          15 FALSE                  5.2
 8 Female        16 FALSE                  5  
 9 Female        16 FALSE                  5.1
10 Male          13 FALSE                  5.1
# ℹ 3,336 more rows

To select all columns except certain ones, put a “-” in front of the variable to exclude it.

select(nhanes, -sequence.id, -family.PIR)
# A tibble: 3,346 × 9
   age.years sex    ethnicity                    birthplace dental.decay.present
       <dbl> <chr>  <chr>                        <chr>      <lgl>               
 1        13 Female Non-Hispanic White           Born in 5… FALSE               
 2        15 Female Non-Hispanic Black           Born in 5… FALSE               
 3        16 Male   Non-Hispanic Black           Born in 5… FALSE               
 4        15 Male   Other Race - Including Mult… Born in 5… FALSE               
 5        13 Female Mexican American             Born in 5… FALSE               
 6        15 Female Non-Hispanic White           Born in 5… FALSE               
 7        15 Male   Non-Hispanic Black           Born in 5… FALSE               
 8        16 Female Non-Hispanic White           Born in 5… FALSE               
 9        16 Female Non-Hispanic White           Born in 5… FALSE               
10        13 Male   Mexican American             Born in 5… FALSE               
# ℹ 3,336 more rows
# ℹ 4 more variables: dental.restoration.present <lgl>, plasma.glucose <dbl>,
#   hba1c <dbl>, bmi <dbl>

This will select all the variables in nhanes except sequence.id and family.PIR.

To choose rows based on a specific criteria, use filter():

filter(nhanes, sex == "Male")
# A tibble: 1,703 × 11
   sequence.id age.years sex   ethnicity                   birthplace family.PIR
         <dbl>     <dbl> <chr> <chr>                       <chr>           <dbl>
 1       32326        16 Male  Non-Hispanic Black          Born in 5…       1.24
 2       32369        15 Male  Other Race - Including Mul… Born in 5…       0.93
 3       31129        15 Male  Non-Hispanic Black          Born in 5…       2.71
 4       31458        13 Male  Mexican American            Born in 5…       3.76
 5       31763        18 Male  Mexican American            Born in 5…       3.1 
 6       31848        16 Male  Non-Hispanic White          Born in 5…       5   
 7       31882        16 Male  Non-Hispanic White          Born in 5…       1.82
 8       32758        13 Male  Non-Hispanic Black          Born in 5…       1.72
 9       32831        15 Male  Mexican American            Born in 5…       1.03
10       32840        13 Male  Non-Hispanic White          Born in 5…       5   
# ℹ 1,693 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
#   dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
#   bmi <dbl>
filter(nhanes, sex == "Male" & plasma.glucose > 80)
# A tibble: 753 × 11
   sequence.id age.years sex   ethnicity          birthplace          family.PIR
         <dbl>     <dbl> <chr> <chr>              <chr>                    <dbl>
 1       31882        16 Male  Non-Hispanic White Born in 50 US Stat…       1.82
 2       32758        13 Male  Non-Hispanic Black Born in 50 US Stat…       1.72
 3       32831        15 Male  Mexican American   Born in 50 US Stat…       1.03
 4       33165        18 Male  Non-Hispanic Black Born in 50 US Stat…       0.73
 5       33215        17 Male  Mexican American   Born in 50 US Stat…      NA   
 6       31698        14 Male  Non-Hispanic Black Born in 50 US Stat…       4.55
 7       32275        13 Male  Mexican American   Born in 50 US Stat…       3.76
 8       33870        17 Male  Non-Hispanic Black Born in 50 US Stat…      NA   
 9       34097        14 Male  Non-Hispanic Black Born in 50 US Stat…       1.12
10       32261        13 Male  Non-Hispanic Black Born in 50 US Stat…       1.2 
# ℹ 743 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
#   dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
#   bmi <dbl>
filter(nhanes, sex == "Male" & !is.na(hba1c))
# A tibble: 1,561 × 11
   sequence.id age.years sex   ethnicity                   birthplace family.PIR
         <dbl>     <dbl> <chr> <chr>                       <chr>           <dbl>
 1       32326        16 Male  Non-Hispanic Black          Born in 5…       1.24
 2       32369        15 Male  Other Race - Including Mul… Born in 5…       0.93
 3       31129        15 Male  Non-Hispanic Black          Born in 5…       2.71
 4       31458        13 Male  Mexican American            Born in 5…       3.76
 5       31763        18 Male  Mexican American            Born in 5…       3.1 
 6       31848        16 Male  Non-Hispanic White          Born in 5…       5   
 7       31882        16 Male  Non-Hispanic White          Born in 5…       1.82
 8       32758        13 Male  Non-Hispanic Black          Born in 5…       1.72
 9       32831        15 Male  Mexican American            Born in 5…       1.03
10       32840        13 Male  Non-Hispanic White          Born in 5…       5   
# ℹ 1,551 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
#   dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
#   bmi <dbl>

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

nhanes2 <- filter(nhanes, sex == "Male")
nhanes3 <- select(nhanes2, sex, age.years, dental.decay.present, hba1c)
nhanes3
# A tibble: 1,703 × 4
   sex   age.years dental.decay.present hba1c
   <chr>     <dbl> <lgl>                <dbl>
 1 Male         16 FALSE                  6.2
 2 Male         15 FALSE                  5.3
 3 Male         15 FALSE                  5.2
 4 Male         13 FALSE                  5.1
 5 Male         18 FALSE                  5.2
 6 Male         16 FALSE                  5  
 7 Male         16 FALSE                  4.5
 8 Male         13 FALSE                  5.4
 9 Male         15 FALSE                  4.5
10 Male         13 FALSE                  5.1
# ℹ 1,693 more rows

This is readable, but can clutter up your workspace with lots of intermediate objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

nhanes3 <- select(filter(nhanes, sex == "Male"), sex, age.years, dental.decay.present, hba1c)
nhanes3
# A tibble: 1,703 × 4
   sex   age.years dental.decay.present hba1c
   <chr>     <dbl> <lgl>                <dbl>
 1 Male         16 FALSE                  6.2
 2 Male         15 FALSE                  5.3
 3 Male         15 FALSE                  5.2
 4 Male         13 FALSE                  5.1
 5 Male         18 FALSE                  5.2
 6 Male         16 FALSE                  5  
 7 Male         16 FALSE                  4.5
 8 Male         13 FALSE                  5.4
 9 Male         15 FALSE                  4.5
10 Male         13 FALSE                  5.1
# ℹ 1,693 more rows

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

Pipes in R look like %>% (made available via the magrittr package) or |> (through base R). If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

In the above code, we use the pipe to send the nhanes dataset first through filter() to keep rows where sex is Male, then through select() to keep only the selected columns.

The pipe %>% takes the object on its left and passes it directly as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

nhanes %>%
  filter(sex == "Male") %>%
  select(sex, age.years, dental.decay.present, hba1c)
# A tibble: 1,703 × 4
   sex   age.years dental.decay.present hba1c
   <chr>     <dbl> <lgl>                <dbl>
 1 Male         16 FALSE                  6.2
 2 Male         15 FALSE                  5.3
 3 Male         15 FALSE                  5.2
 4 Male         13 FALSE                  5.1
 5 Male         18 FALSE                  5.2
 6 Male         16 FALSE                  5  
 7 Male         16 FALSE                  4.5
 8 Male         13 FALSE                  5.4
 9 Male         15 FALSE                  4.5
10 Male         13 FALSE                  5.1
# ℹ 1,693 more rows

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame rna, then we filtered for rows with sex == "Male", then we selected columns sex, age.years, dental.decay.present, and hba1c.

The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

nhanes3 %>%
  filter(sex == "Male") %>%
  select(sex, age.years, dental.decay.present, hba1c)
# A tibble: 1,703 × 4
   sex   age.years dental.decay.present hba1c
   <chr>     <dbl> <lgl>                <dbl>
 1 Male         16 FALSE                  6.2
 2 Male         15 FALSE                  5.3
 3 Male         15 FALSE                  5.2
 4 Male         13 FALSE                  5.1
 5 Male         18 FALSE                  5.2
 6 Male         16 FALSE                  5  
 7 Male         16 FALSE                  4.5
 8 Male         13 FALSE                  5.4
 9 Male         15 FALSE                  4.5
10 Male         13 FALSE                  5.1
# ℹ 1,693 more rows
nhanes3
# A tibble: 1,703 × 4
   sex   age.years dental.decay.present hba1c
   <chr>     <dbl> <lgl>                <dbl>
 1 Male         16 FALSE                  6.2
 2 Male         15 FALSE                  5.3
 3 Male         15 FALSE                  5.2
 4 Male         13 FALSE                  5.1
 5 Male         18 FALSE                  5.2
 6 Male         16 FALSE                  5  
 7 Male         16 FALSE                  4.5
 8 Male         13 FALSE                  5.4
 9 Male         15 FALSE                  4.5
10 Male         13 FALSE                  5.1
# ℹ 1,693 more rows
Challenge:

Using pipes, subset the nhanes data to keep female participants 15 years or older, where the hba1c is greater than 5.2 (and is not NA), and retain only the columns sex, age.years, and plasma.glucose.

nhanes %>%
  filter(hba1c > 5.2,
         sex == "Female",
         age.years >= 15 ) %>%
  select(sex, age.years, plasma.glucose)
# A tibble: 319 × 3
   sex    age.years plasma.glucose
   <chr>      <dbl>          <dbl>
 1 Female        16            132
 2 Female        16             NA
 3 Female        17             90
 4 Female        15             90
 5 Female        15             83
 6 Female        18             92
 7 Female        17             NA
 8 Female        17             NA
 9 Female        17             79
10 Female        18             93
# ℹ 309 more rows

Mutate

Frequently you’ll want to create new columns based on the values of existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of age in months:

nhanes %>%
  mutate(age.months = age.years * 12) %>%
  select(age.months, age.years)
# A tibble: 3,346 × 2
   age.months age.years
        <dbl>     <dbl>
 1        156        13
 2        180        15
 3        192        16
 4        180        15
 5        156        13
 6        180        15
 7        180        15
 8        192        16
 9        192        16
10        156        13
# ℹ 3,336 more rows

You can also create a second new column based on the first new column within the same call of mutate():

nhanes %>%
  mutate(age.months = age.years * 12,
         lived_200_months = age.months >= 200) %>%
  select(age.months, age.years, lived_200_months)
# A tibble: 3,346 × 3
   age.months age.years lived_200_months
        <dbl>     <dbl> <lgl>           
 1        156        13 FALSE           
 2        180        15 FALSE           
 3        192        16 FALSE           
 4        180        15 FALSE           
 5        156        13 FALSE           
 6        180        15 FALSE           
 7        180        15 FALSE           
 8        192        16 FALSE           
 9        192        16 FALSE           
10        156        13 FALSE           
# ℹ 3,336 more rows
Challenge

Create a new data frame from the nhanes data that meets the following criteria: contains the columns sex, under_14, ethnicity, and family.PIR columns. The under_14 column should be a logical variable indicating whether or not the age of the participant is less than 14. This data frame must only contain participants born in the USA or Mexico, have a non-missing plasma glucose value, and with a BMI less than 30.

nhanes %>%
  mutate(under_14 = age.years < 14) %>%
  filter(birthplace == 'Born in 50 US States or Washi' | birthplace == "Born in Mexico") %>%
  filter(!is.na(plasma.glucose)) %>%
  filter(bmi < 30) %>%
  select(sex, under_14, ethnicity, family.PIR)
# A tibble: 1,147 × 4
   sex    under_14 ethnicity          family.PIR
   <chr>  <lgl>    <chr>                   <dbl>
 1 Female TRUE     Non-Hispanic White       5   
 2 Female FALSE    Non-Hispanic Black       0.67
 3 Female FALSE    Non-Hispanic White       1.06
 4 Male   FALSE    Non-Hispanic White       1.82
 5 Male   TRUE     Non-Hispanic Black       1.72
 6 Male   FALSE    Mexican American         1.03
 7 Male   FALSE    Non-Hispanic Black       0.73
 8 Male   FALSE    Mexican American        NA   
 9 Female FALSE    Non-Hispanic White       1.2 
10 Male   FALSE    Non-Hispanic Black       2.23
# ℹ 1,137 more rows

Split-apply-combine data analysis

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

nhanes %>%
  group_by(birthplace)
# A tibble: 3,346 × 11
# Groups:   birthplace [7]
   sequence.id age.years sex    ethnicity                  birthplace family.PIR
         <dbl>     <dbl> <chr>  <chr>                      <chr>           <dbl>
 1       31521        13 Female Non-Hispanic White         Born in 5…       5   
 2       31981        15 Female Non-Hispanic Black         Born in 5…       0.67
 3       32326        16 Male   Non-Hispanic Black         Born in 5…       1.24
 4       32369        15 Male   Other Race - Including Mu… Born in 5…       0.93
 5       32537        13 Female Mexican American           Born in 5…       4.13
 6       32604        15 Female Non-Hispanic White         Born in 5…       5   
 7       31129        15 Male   Non-Hispanic Black         Born in 5…       2.71
 8       31148        16 Female Non-Hispanic White         Born in 5…       1.06
 9       31347        16 Female Non-Hispanic White         Born in 5…       3.1 
10       31458        13 Male   Mexican American           Born in 5…       3.76
# ℹ 3,336 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
#   dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
#   bmi <dbl>

The group_by() function doesn’t perform any data processing, it groups the data into subsets: in the example above, our initial tibble of 3346 observations is split into 7 groups based on the birthplace variable.

We could similarly decide to group the tibble by sex:

nhanes %>%
  group_by(sex)
# A tibble: 3,346 × 11
# Groups:   sex [2]
   sequence.id age.years sex    ethnicity                  birthplace family.PIR
         <dbl>     <dbl> <chr>  <chr>                      <chr>           <dbl>
 1       31521        13 Female Non-Hispanic White         Born in 5…       5   
 2       31981        15 Female Non-Hispanic Black         Born in 5…       0.67
 3       32326        16 Male   Non-Hispanic Black         Born in 5…       1.24
 4       32369        15 Male   Other Race - Including Mu… Born in 5…       0.93
 5       32537        13 Female Mexican American           Born in 5…       4.13
 6       32604        15 Female Non-Hispanic White         Born in 5…       5   
 7       31129        15 Male   Non-Hispanic Black         Born in 5…       2.71
 8       31148        16 Female Non-Hispanic White         Born in 5…       1.06
 9       31347        16 Female Non-Hispanic White         Born in 5…       3.1 
10       31458        13 Male   Mexican American           Born in 5…       3.76
# ℹ 3,336 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
#   dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
#   bmi <dbl>

Here our initial tibble of 3346 observations is split into 2 groups based on the sex variable.

Once the data has been grouped, subsequent operations will be applied on each group independently.

The summarise() function

group_by() is often used together with summarise(), which collapses each group into a single-row summary of that group.

group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean bmi by birthplace:

nhanes %>%
  group_by(birthplace) %>%
  summarise(mean_bmi = mean(bmi, na.rm = TRUE))
# A tibble: 7 × 2
  birthplace                     mean_bmi
  <chr>                             <dbl>
1 Born Elsewhere                     22.2
2 Born in 50 US States or Washi      24.3
3 Born in Mexico                     24.2
4 Born in Other Non-Spanish Spea     22.5
5 Born in Other Spanish Speaking     24.3
6 Don't Know                         22.9
7 Refused                            23.9

But we can can also group by multiple columns:

nhanes %>%
  group_by(sex, ethnicity) %>%
  summarise(mean_bmi = mean(bmi, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 10 × 3
# Groups:   sex [2]
   sex    ethnicity                      mean_bmi
   <chr>  <chr>                             <dbl>
 1 Female Mexican American                   24.5
 2 Female Non-Hispanic Black                 26.0
 3 Female Non-Hispanic White                 23.3
 4 Female Other Hispanic                     24.5
 5 Female Other Race - Including Multi-R     22.9
 6 Male   Mexican American                   24.9
 7 Male   Non-Hispanic Black                 23.9
 8 Male   Non-Hispanic White                 23.4
 9 Male   Other Hispanic                     24.0
10 Male   Other Race - Including Multi-R     23.8

Once the data is grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the median plasma.glucose by sex and ethnicity:

nhanes %>%
  group_by(sex, ethnicity) %>%
  summarise(mean_plasma_glucose = mean(plasma.glucose),
            median_plasma_glucose = median(plasma.glucose))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 10 × 4
# Groups:   sex [2]
   sex    ethnicity                    mean_plasma_glucose median_plasma_glucose
   <chr>  <chr>                                      <dbl>                 <dbl>
 1 Female Mexican American                              NA                    NA
 2 Female Non-Hispanic Black                            NA                    NA
 3 Female Non-Hispanic White                            NA                    NA
 4 Female Other Hispanic                                NA                    NA
 5 Female Other Race - Including Mult…                  NA                    NA
 6 Male   Mexican American                              NA                    NA
 7 Male   Non-Hispanic Black                            NA                    NA
 8 Male   Non-Hispanic White                            NA                    NA
 9 Male   Other Hispanic                                NA                    NA
10 Male   Other Race - Including Mult…                  NA                    NA
Challenge

Calculate the mean hba1c of participants born in the USA by age.

nhanes %>%
  filter(birthplace == 'Born in 50 US States or Washi') %>%
  group_by(age.years) %>%
  summarise(mean_hba1c = mean(hba1c))
# A tibble: 6 × 2
  age.years mean_hba1c
      <dbl>      <dbl>
1        13         NA
2        14         NA
3        15         NA
4        16         NA
5        17         NA
6        18         NA

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each age, we would do:

nhanes %>%
    count(age.years)
# A tibble: 6 × 2
  age.years     n
      <dbl> <int>
1        13   537
2        14   576
3        15   558
4        16   590
5        17   558
6        18   527

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, nhanes %>% count(age.years) is equivalent to:

nhanes %>%
    group_by(age.years) %>%
    summarise(n = n())
# A tibble: 6 × 2
  age.years     n
      <dbl> <int>
1        13   537
2        14   576
3        15   558
4        16   590
5        17   558
6        18   527

The previous example shows the use of count() to count the number of rows/observations for one factor (i.e., infection). If we wanted to count a combination of factors, such as age and sex, we would specify the first and the second factor as the arguments of count():

nhanes %>%
    count(age.years, sex)
# A tibble: 12 × 3
   age.years sex        n
       <dbl> <chr>  <int>
 1        13 Female   268
 2        13 Male     269
 3        14 Female   290
 4        14 Male     286
 5        15 Female   271
 6        15 Male     287
 7        16 Female   288
 8        16 Male     302
 9        17 Female   263
10        17 Male     295
11        18 Female   263
12        18 Male     264

It is sometimes useful to sort the result to facilitate the comparisons. We can use arrange() to sort the table. For instance, we might want to arrange the table above by age:

nhanes %>%
    count(age.years, sex) %>%
    arrange(age.years)
# A tibble: 12 × 3
   age.years sex        n
       <dbl> <chr>  <int>
 1        13 Female   268
 2        13 Male     269
 3        14 Female   290
 4        14 Male     286
 5        15 Female   271
 6        15 Male     287
 7        16 Female   288
 8        16 Male     302
 9        17 Female   263
10        17 Male     295
11        18 Female   263
12        18 Male     264

or by counts:

nhanes %>%
    count(age.years, sex) %>%
    arrange(n)
# A tibble: 12 × 3
   age.years sex        n
       <dbl> <chr>  <int>
 1        17 Female   263
 2        18 Female   263
 3        18 Male     264
 4        13 Female   268
 5        13 Male     269
 6        15 Female   271
 7        14 Male     286
 8        15 Male     287
 9        16 Female   288
10        14 Female   290
11        17 Male     295
12        16 Male     302

To sort in descending order, we need to add the desc() function:

nhanes %>%
    count(age.years, sex) %>%
    arrange(desc(n))
# A tibble: 12 × 3
   age.years sex        n
       <dbl> <chr>  <int>
 1        16 Male     302
 2        17 Male     295
 3        14 Female   290
 4        16 Female   288
 5        15 Male     287
 6        14 Male     286
 7        15 Female   271
 8        13 Male     269
 9        13 Female   268
10        18 Male     264
11        17 Female   263
12        18 Female   263
Challenge
  1. How many participants have a non-empty plasma glucose value for each age?
  2. Use group_by() and summarise() to evaluate the poverty income ratio (family.PIR) by ethnicity. Which ethnicity has the highest poverty income ratio?
#1
nhanes %>% filter(!is.na(plasma.glucose)) %>%
  group_by(age.years) %>%
  count()
# A tibble: 6 × 2
# Groups:   age.years [6]
  age.years     n
      <dbl> <int>
1        13   227
2        14   232
3        15   259
4        16   258
5        17   240
6        18   241
#2
nhanes %>% group_by(ethnicity) %>%
  summarise(mean_PIR = mean(family.PIR, na.rm = TRUE)) %>%
  arrange(desc(mean_PIR))
# A tibble: 5 × 2
  ethnicity                      mean_PIR
  <chr>                             <dbl>
1 Non-Hispanic White                 2.86
2 Other Race - Including Multi-R     2.35
3 Other Hispanic                     1.96
4 Non-Hispanic Black                 1.90
5 Mexican American                   1.74

Exporting data

Now that you have learned how to use dplyr to extract information from or summarise your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.

Let’s use write_csv() to save the nhanes data.

write_csv(nhanes, file = "nhanes_processed.csv")

The materials in this lesson have been adapted from work created by the (HBC)](http://bioinformatics.sph.harvard.edu/) and Data Carpentry (http://datacarpentry.org/), as well as materials created by Laurent Gatto, Charlotte Soneson, Jenny Drnevich, Robert Castelo, and Kevin Rue-Albert. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.