<- read.csv("https://raw.githubusercontent.com/ccb-hms/hsdm-r-course/main/session-materials/session0/session0Data.csv") nhanes
Bootcamp Part 2: Starting with Data
Dataframes
We will begin with a mostly processed dataset from NHANES designed to examine the relationship between diabetes and dental caries in adolescents. Specifically, this dataset contains the 3346 adolescents recorded in NHANES from 2005 to 2010 with non-missing dental decay data.
We can load the data into R as a dataframe using the read.csv
function.
This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name:
nhanes
Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head()
:
head(nhanes)
sequence.id age.years sex ethnicity
1 31521 13 Female Non-Hispanic White
2 31981 15 Female Non-Hispanic Black
3 32326 16 Male Non-Hispanic Black
4 32369 15 Male Other Race - Including Multi-R
5 32537 13 Female Mexican American
6 32604 15 Female Non-Hispanic White
birthplace family.PIR dental.decay.present
1 Born in 50 US States or Washi 5.00 FALSE
2 Born in 50 US States or Washi 0.67 FALSE
3 Born in 50 US States or Washi 1.24 FALSE
4 Born in 50 US States or Washi 0.93 FALSE
5 Born in 50 US States or Washi 4.13 FALSE
6 Born in 50 US States or Washi 5.00 FALSE
dental.restoration.present plasma.glucose hba1c bmi
1 FALSE 88 5.1 17.62
2 FALSE 94 4.7 20.65
3 FALSE NA 6.2 35.62
4 TRUE NA 5.3 27.72
5 FALSE NA 5.1 19.43
6 FALSE NA 5.0 21.22
What are data frames?
Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.
A data frame can be created by hand, but most commonly they are generated by the functions read.csv()
or read.table()
; in other words, when importing spreadsheets from your hard drive (or the web).
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors).
We can see this when inspecting the structure of a data frame with the function str()
:
str(nhanes)
'data.frame': 3346 obs. of 11 variables:
$ sequence.id : int 31521 31981 32326 32369 32537 32604 31129 31148 31347 31458 ...
$ age.years : int 13 15 16 15 13 15 15 16 16 13 ...
$ sex : chr "Female" "Female" "Male" "Male" ...
$ ethnicity : chr "Non-Hispanic White" "Non-Hispanic Black" "Non-Hispanic Black" "Other Race - Including Multi-R" ...
$ birthplace : chr "Born in 50 US States or Washi" "Born in 50 US States or Washi" "Born in 50 US States or Washi" "Born in 50 US States or Washi" ...
$ family.PIR : num 5 0.67 1.24 0.93 4.13 5 2.71 1.06 3.1 3.76 ...
$ dental.decay.present : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ dental.restoration.present: logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ plasma.glucose : int 88 94 NA NA NA NA NA 91 NA NA ...
$ hba1c : num 5.1 4.7 6.2 5.3 5.1 5 5.2 5 5.1 5.1 ...
$ bmi : num 17.6 20.6 35.6 27.7 19.4 ...
Inspecting data.frame
Objects
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!
Size:
dim(nhanes)
- returns a vector with the number of rows as the first element, and the number of columns as the second element (the dimensions of the object).nrow(nhanes)
- returns the number of rows.ncol(nhanes)
- returns the number of columns.
Content:
head(nhanes)
- shows the first 6 rows.tail(nhanes)
- shows the last 6 rows.
Names:
names(nhanes)
- returns the column names (synonym ofcolnames()
fordata.frame
objects).rownames(nhanes)
- returns the row names.
Summary:
str(nhanes)
- structure of the object and information about the class, length and content of each column.summary(nhanes)
- summary statistics for each column.
Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Packages
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
Some packages can greatly facilitate our task when we manipulate data. Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str()
or data.frame()
, come built into R; Loading packages can give you access to other specific functions. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.
The package
dplyr
provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks optimised.As we will see latter on, sometimes we want a data frame to be reshaped to be able to do some specific analyses or for visualisation. The package
tidyr
addresses this common problem of reshaping data and provides tools for manipulating data in a tidy way.
To learn more about dplyr
and tidyr
after the workshop, you may want to check out this handy data transformation with dplyr
cheatsheet and this one about tidyr
.
- The
tidyverse
package is an “umbrella-package” that installs several useful packages for data analysis which work well together, such astidyr
,dplyr
,ggplot2
,tibble
, etc. These packages help us to work and interact with the data. They allow us to do many things with your data, such as subsetting, transforming, visualising, etc.
Packages can be installed using the install.packages
command. This downloads and installs the package into your entire R installation. This means that you would not need to re-install the package for a new project.
install.packages("tidyverse")
Once a package is installed, it can be loaded using the library
function. This tells R to use all of the functions inside the loaded package. library
loads packages into your R session, and thus this line needs to be run each time you open R.
## load the tidyverse packages, incl. dplyr
library("tidyverse")
library("DT")
Loading data with tidyverse
Instead of read.csv()
, we will read in our data using the read_csv()
function (notice the _
instead of the .
), from the tidyverse package readr
.
<- read_csv("https://raw.githubusercontent.com/ccb-hms/hsdm-r-course/main/session-materials/session0/session0Data.csv")
nhanes
## view the data
datatable(nhanes)
Notice that the class of the data is now referred to as a “tibble”.
Tibbles tweak some of the behaviors of the data frame objects we introduced in the previously. The data structure is very similar to a data frame. For our purposes the only differences are that:
It displays the data type of each column under its name. Note that <
dbl
> is a data type defined to hold numeric values with decimal points.It only prints the first few rows of data and only as many columns as fit on one screen.
We are now going to learn some of the most common dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarise()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select()
. The first argument to this function is the data frame (nhanes
), and the subsequent arguments are the columns to keep.
select(nhanes, sex, age.years, dental.decay.present, hba1c)
# A tibble: 3,346 × 4
sex age.years dental.decay.present hba1c
<chr> <dbl> <lgl> <dbl>
1 Female 13 FALSE 5.1
2 Female 15 FALSE 4.7
3 Male 16 FALSE 6.2
4 Male 15 FALSE 5.3
5 Female 13 FALSE 5.1
6 Female 15 FALSE 5
7 Male 15 FALSE 5.2
8 Female 16 FALSE 5
9 Female 16 FALSE 5.1
10 Male 13 FALSE 5.1
# ℹ 3,336 more rows
To select all columns except certain ones, put a “-” in front of the variable to exclude it.
select(nhanes, -sequence.id, -family.PIR)
# A tibble: 3,346 × 9
age.years sex ethnicity birthplace dental.decay.present
<dbl> <chr> <chr> <chr> <lgl>
1 13 Female Non-Hispanic White Born in 5… FALSE
2 15 Female Non-Hispanic Black Born in 5… FALSE
3 16 Male Non-Hispanic Black Born in 5… FALSE
4 15 Male Other Race - Including Mult… Born in 5… FALSE
5 13 Female Mexican American Born in 5… FALSE
6 15 Female Non-Hispanic White Born in 5… FALSE
7 15 Male Non-Hispanic Black Born in 5… FALSE
8 16 Female Non-Hispanic White Born in 5… FALSE
9 16 Female Non-Hispanic White Born in 5… FALSE
10 13 Male Mexican American Born in 5… FALSE
# ℹ 3,336 more rows
# ℹ 4 more variables: dental.restoration.present <lgl>, plasma.glucose <dbl>,
# hba1c <dbl>, bmi <dbl>
This will select all the variables in nhanes
except sequence.id
and family.PIR
.
To choose rows based on a specific criteria, use filter()
:
filter(nhanes, sex == "Male")
# A tibble: 1,703 × 11
sequence.id age.years sex ethnicity birthplace family.PIR
<dbl> <dbl> <chr> <chr> <chr> <dbl>
1 32326 16 Male Non-Hispanic Black Born in 5… 1.24
2 32369 15 Male Other Race - Including Mul… Born in 5… 0.93
3 31129 15 Male Non-Hispanic Black Born in 5… 2.71
4 31458 13 Male Mexican American Born in 5… 3.76
5 31763 18 Male Mexican American Born in 5… 3.1
6 31848 16 Male Non-Hispanic White Born in 5… 5
7 31882 16 Male Non-Hispanic White Born in 5… 1.82
8 32758 13 Male Non-Hispanic Black Born in 5… 1.72
9 32831 15 Male Mexican American Born in 5… 1.03
10 32840 13 Male Non-Hispanic White Born in 5… 5
# ℹ 1,693 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
# dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
# bmi <dbl>
filter(nhanes, sex == "Male" & plasma.glucose > 80)
# A tibble: 753 × 11
sequence.id age.years sex ethnicity birthplace family.PIR
<dbl> <dbl> <chr> <chr> <chr> <dbl>
1 31882 16 Male Non-Hispanic White Born in 50 US Stat… 1.82
2 32758 13 Male Non-Hispanic Black Born in 50 US Stat… 1.72
3 32831 15 Male Mexican American Born in 50 US Stat… 1.03
4 33165 18 Male Non-Hispanic Black Born in 50 US Stat… 0.73
5 33215 17 Male Mexican American Born in 50 US Stat… NA
6 31698 14 Male Non-Hispanic Black Born in 50 US Stat… 4.55
7 32275 13 Male Mexican American Born in 50 US Stat… 3.76
8 33870 17 Male Non-Hispanic Black Born in 50 US Stat… NA
9 34097 14 Male Non-Hispanic Black Born in 50 US Stat… 1.12
10 32261 13 Male Non-Hispanic Black Born in 50 US Stat… 1.2
# ℹ 743 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
# dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
# bmi <dbl>
filter(nhanes, sex == "Male" & !is.na(hba1c))
# A tibble: 1,561 × 11
sequence.id age.years sex ethnicity birthplace family.PIR
<dbl> <dbl> <chr> <chr> <chr> <dbl>
1 32326 16 Male Non-Hispanic Black Born in 5… 1.24
2 32369 15 Male Other Race - Including Mul… Born in 5… 0.93
3 31129 15 Male Non-Hispanic Black Born in 5… 2.71
4 31458 13 Male Mexican American Born in 5… 3.76
5 31763 18 Male Mexican American Born in 5… 3.1
6 31848 16 Male Non-Hispanic White Born in 5… 5
7 31882 16 Male Non-Hispanic White Born in 5… 1.82
8 32758 13 Male Non-Hispanic Black Born in 5… 1.72
9 32831 15 Male Mexican American Born in 5… 1.03
10 32840 13 Male Non-Hispanic White Born in 5… 5
# ℹ 1,551 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
# dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
# bmi <dbl>
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
<- filter(nhanes, sex == "Male")
nhanes2 <- select(nhanes2, sex, age.years, dental.decay.present, hba1c)
nhanes3 nhanes3
# A tibble: 1,703 × 4
sex age.years dental.decay.present hba1c
<chr> <dbl> <lgl> <dbl>
1 Male 16 FALSE 6.2
2 Male 15 FALSE 5.3
3 Male 15 FALSE 5.2
4 Male 13 FALSE 5.1
5 Male 18 FALSE 5.2
6 Male 16 FALSE 5
7 Male 16 FALSE 4.5
8 Male 13 FALSE 5.4
9 Male 15 FALSE 4.5
10 Male 13 FALSE 5.1
# ℹ 1,693 more rows
This is readable, but can clutter up your workspace with lots of intermediate objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
<- select(filter(nhanes, sex == "Male"), sex, age.years, dental.decay.present, hba1c)
nhanes3 nhanes3
# A tibble: 1,703 × 4
sex age.years dental.decay.present hba1c
<chr> <dbl> <lgl> <dbl>
1 Male 16 FALSE 6.2
2 Male 15 FALSE 5.3
3 Male 15 FALSE 5.2
4 Male 13 FALSE 5.1
5 Male 18 FALSE 5.2
6 Male 16 FALSE 5
7 Male 16 FALSE 4.5
8 Male 13 FALSE 5.4
9 Male 15 FALSE 4.5
10 Male 13 FALSE 5.1
# ℹ 1,693 more rows
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.
Pipes in R look like %>%
(made available via the magrittr
package) or |>
(through base R). If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
In the above code, we use the pipe to send the nhanes
dataset first through filter()
to keep rows where sex
is Male, then through select()
to keep only the selected columns.
The pipe %>%
takes the object on its left and passes it directly as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter()
and select()
functions any more.
%>%
nhanes filter(sex == "Male") %>%
select(sex, age.years, dental.decay.present, hba1c)
# A tibble: 1,703 × 4
sex age.years dental.decay.present hba1c
<chr> <dbl> <lgl> <dbl>
1 Male 16 FALSE 6.2
2 Male 15 FALSE 5.3
3 Male 15 FALSE 5.2
4 Male 13 FALSE 5.1
5 Male 18 FALSE 5.2
6 Male 16 FALSE 5
7 Male 16 FALSE 4.5
8 Male 13 FALSE 5.4
9 Male 15 FALSE 4.5
10 Male 13 FALSE 5.1
# ℹ 1,693 more rows
Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame rna
, then we filter
ed for rows with sex == "Male"
, then we select
ed columns sex
, age.years
, dental.decay.present
, and hba1c
.
The dplyr
functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
%>%
nhanes3 filter(sex == "Male") %>%
select(sex, age.years, dental.decay.present, hba1c)
# A tibble: 1,703 × 4
sex age.years dental.decay.present hba1c
<chr> <dbl> <lgl> <dbl>
1 Male 16 FALSE 6.2
2 Male 15 FALSE 5.3
3 Male 15 FALSE 5.2
4 Male 13 FALSE 5.1
5 Male 18 FALSE 5.2
6 Male 16 FALSE 5
7 Male 16 FALSE 4.5
8 Male 13 FALSE 5.4
9 Male 15 FALSE 4.5
10 Male 13 FALSE 5.1
# ℹ 1,693 more rows
nhanes3
# A tibble: 1,703 × 4
sex age.years dental.decay.present hba1c
<chr> <dbl> <lgl> <dbl>
1 Male 16 FALSE 6.2
2 Male 15 FALSE 5.3
3 Male 15 FALSE 5.2
4 Male 13 FALSE 5.1
5 Male 18 FALSE 5.2
6 Male 16 FALSE 5
7 Male 16 FALSE 4.5
8 Male 13 FALSE 5.4
9 Male 15 FALSE 4.5
10 Male 13 FALSE 5.1
# ℹ 1,693 more rows
Mutate
Frequently you’ll want to create new columns based on the values of existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of age in months:
%>%
nhanes mutate(age.months = age.years * 12) %>%
select(age.months, age.years)
# A tibble: 3,346 × 2
age.months age.years
<dbl> <dbl>
1 156 13
2 180 15
3 192 16
4 180 15
5 156 13
6 180 15
7 180 15
8 192 16
9 192 16
10 156 13
# ℹ 3,336 more rows
You can also create a second new column based on the first new column within the same call of mutate()
:
%>%
nhanes mutate(age.months = age.years * 12,
lived_200_months = age.months >= 200) %>%
select(age.months, age.years, lived_200_months)
# A tibble: 3,346 × 3
age.months age.years lived_200_months
<dbl> <dbl> <lgl>
1 156 13 FALSE
2 180 15 FALSE
3 192 16 FALSE
4 180 15 FALSE
5 156 13 FALSE
6 180 15 FALSE
7 180 15 FALSE
8 192 16 FALSE
9 192 16 FALSE
10 156 13 FALSE
# ℹ 3,336 more rows
Split-apply-combine data analysis
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
%>%
nhanes group_by(birthplace)
# A tibble: 3,346 × 11
# Groups: birthplace [7]
sequence.id age.years sex ethnicity birthplace family.PIR
<dbl> <dbl> <chr> <chr> <chr> <dbl>
1 31521 13 Female Non-Hispanic White Born in 5… 5
2 31981 15 Female Non-Hispanic Black Born in 5… 0.67
3 32326 16 Male Non-Hispanic Black Born in 5… 1.24
4 32369 15 Male Other Race - Including Mu… Born in 5… 0.93
5 32537 13 Female Mexican American Born in 5… 4.13
6 32604 15 Female Non-Hispanic White Born in 5… 5
7 31129 15 Male Non-Hispanic Black Born in 5… 2.71
8 31148 16 Female Non-Hispanic White Born in 5… 1.06
9 31347 16 Female Non-Hispanic White Born in 5… 3.1
10 31458 13 Male Mexican American Born in 5… 3.76
# ℹ 3,336 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
# dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
# bmi <dbl>
The group_by()
function doesn’t perform any data processing, it groups the data into subsets: in the example above, our initial tibble
of 3346 observations is split into 7 groups based on the birthplace
variable.
We could similarly decide to group the tibble by sex:
%>%
nhanes group_by(sex)
# A tibble: 3,346 × 11
# Groups: sex [2]
sequence.id age.years sex ethnicity birthplace family.PIR
<dbl> <dbl> <chr> <chr> <chr> <dbl>
1 31521 13 Female Non-Hispanic White Born in 5… 5
2 31981 15 Female Non-Hispanic Black Born in 5… 0.67
3 32326 16 Male Non-Hispanic Black Born in 5… 1.24
4 32369 15 Male Other Race - Including Mu… Born in 5… 0.93
5 32537 13 Female Mexican American Born in 5… 4.13
6 32604 15 Female Non-Hispanic White Born in 5… 5
7 31129 15 Male Non-Hispanic Black Born in 5… 2.71
8 31148 16 Female Non-Hispanic White Born in 5… 1.06
9 31347 16 Female Non-Hispanic White Born in 5… 3.1
10 31458 13 Male Mexican American Born in 5… 3.76
# ℹ 3,336 more rows
# ℹ 5 more variables: dental.decay.present <lgl>,
# dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
# bmi <dbl>
Here our initial tibble
of 3346 observations is split into 2 groups based on the sex
variable.
Once the data has been grouped, subsequent operations will be applied on each group independently.
The summarise()
function
group_by()
is often used together with summarise()
, which collapses each group into a single-row summary of that group.
group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean bmi
by birthplace:
%>%
nhanes group_by(birthplace) %>%
summarise(mean_bmi = mean(bmi, na.rm = TRUE))
# A tibble: 7 × 2
birthplace mean_bmi
<chr> <dbl>
1 Born Elsewhere 22.2
2 Born in 50 US States or Washi 24.3
3 Born in Mexico 24.2
4 Born in Other Non-Spanish Spea 22.5
5 Born in Other Spanish Speaking 24.3
6 Don't Know 22.9
7 Refused 23.9
But we can can also group by multiple columns:
%>%
nhanes group_by(sex, ethnicity) %>%
summarise(mean_bmi = mean(bmi, na.rm = TRUE))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 10 × 3
# Groups: sex [2]
sex ethnicity mean_bmi
<chr> <chr> <dbl>
1 Female Mexican American 24.5
2 Female Non-Hispanic Black 26.0
3 Female Non-Hispanic White 23.3
4 Female Other Hispanic 24.5
5 Female Other Race - Including Multi-R 22.9
6 Male Mexican American 24.9
7 Male Non-Hispanic Black 23.9
8 Male Non-Hispanic White 23.4
9 Male Other Hispanic 24.0
10 Male Other Race - Including Multi-R 23.8
Once the data is grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the median plasma.glucose
by sex and ethnicity:
%>%
nhanes group_by(sex, ethnicity) %>%
summarise(mean_plasma_glucose = mean(plasma.glucose),
median_plasma_glucose = median(plasma.glucose))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 10 × 4
# Groups: sex [2]
sex ethnicity mean_plasma_glucose median_plasma_glucose
<chr> <chr> <dbl> <dbl>
1 Female Mexican American NA NA
2 Female Non-Hispanic Black NA NA
3 Female Non-Hispanic White NA NA
4 Female Other Hispanic NA NA
5 Female Other Race - Including Mult… NA NA
6 Male Mexican American NA NA
7 Male Non-Hispanic Black NA NA
8 Male Non-Hispanic White NA NA
9 Male Other Hispanic NA NA
10 Male Other Race - Including Mult… NA NA
Counting
When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr
provides count()
. For example, if we wanted to count the number of rows of data for each age, we would do:
%>%
nhanes count(age.years)
# A tibble: 6 × 2
age.years n
<dbl> <int>
1 13 537
2 14 576
3 15 558
4 16 590
5 17 558
6 18 527
The count()
function is shorthand for something we’ve already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, nhanes %>% count(age.years)
is equivalent to:
%>%
nhanes group_by(age.years) %>%
summarise(n = n())
# A tibble: 6 × 2
age.years n
<dbl> <int>
1 13 537
2 14 576
3 15 558
4 16 590
5 17 558
6 18 527
The previous example shows the use of count()
to count the number of rows/observations for one factor (i.e., infection
). If we wanted to count a combination of factors, such as age
and sex
, we would specify the first and the second factor as the arguments of count()
:
%>%
nhanes count(age.years, sex)
# A tibble: 12 × 3
age.years sex n
<dbl> <chr> <int>
1 13 Female 268
2 13 Male 269
3 14 Female 290
4 14 Male 286
5 15 Female 271
6 15 Male 287
7 16 Female 288
8 16 Male 302
9 17 Female 263
10 17 Male 295
11 18 Female 263
12 18 Male 264
It is sometimes useful to sort the result to facilitate the comparisons. We can use arrange()
to sort the table. For instance, we might want to arrange the table above by age:
%>%
nhanes count(age.years, sex) %>%
arrange(age.years)
# A tibble: 12 × 3
age.years sex n
<dbl> <chr> <int>
1 13 Female 268
2 13 Male 269
3 14 Female 290
4 14 Male 286
5 15 Female 271
6 15 Male 287
7 16 Female 288
8 16 Male 302
9 17 Female 263
10 17 Male 295
11 18 Female 263
12 18 Male 264
or by counts:
%>%
nhanes count(age.years, sex) %>%
arrange(n)
# A tibble: 12 × 3
age.years sex n
<dbl> <chr> <int>
1 17 Female 263
2 18 Female 263
3 18 Male 264
4 13 Female 268
5 13 Male 269
6 15 Female 271
7 14 Male 286
8 15 Male 287
9 16 Female 288
10 14 Female 290
11 17 Male 295
12 16 Male 302
To sort in descending order, we need to add the desc()
function:
%>%
nhanes count(age.years, sex) %>%
arrange(desc(n))
# A tibble: 12 × 3
age.years sex n
<dbl> <chr> <int>
1 16 Male 302
2 17 Male 295
3 14 Female 290
4 16 Female 288
5 15 Male 287
6 14 Male 286
7 15 Female 271
8 13 Male 269
9 13 Female 268
10 18 Male 264
11 17 Female 263
12 18 Female 263
Exporting data
Now that you have learned how to use dplyr
to extract information from or summarise your raw data, you may want to export these new data sets to share them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV files into R, there is a write_csv()
function that generates CSV files from data frames.
Before using write_csv()
, we are going to create a new folder, data_output
, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data
folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output
directory, so even if the files it contains are deleted, we can always re-generate them.
Let’s use write_csv()
to save the nhanes data.
write_csv(nhanes, file = "nhanes_processed.csv")
The materials in this lesson have been adapted from work created by the (HBC)](http://bioinformatics.sph.harvard.edu/) and Data Carpentry (http://datacarpentry.org/), as well as materials created by Laurent Gatto, Charlotte Soneson, Jenny Drnevich, Robert Castelo, and Kevin Rue-Albert. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.