The NHANES data set provides a large diverse set of data to study health and other sociological and epidemiological topics on the US population. It is used very widely and the bulk of the data are publicly available. Our goal with the Epiconductor project is to enable users access to the data using a container that contains an SQL database and R together with other software to support different analyses of the data. We believe that this will increase access and provide a platform for reproducibility. This document outlines some of the ways you can interact with the NHANES data while using the container and R packages phonto and nhanesA.
The National Health and Nutrition Examination Survey (NHANES) datasets are collected from the Centers for Disease Control and Prevention in the USA, including demographics, dietary, laboratory, examination, and questionnaire data. The five publicly available data categories are:
The abbreviated forms in parentheses may be substituted for the long form in phonto and nhanesA commands. There is also limited access data, eg. genetics, that requires written justification and prior approval before users are granted access. We restrict our tools to the publicly available data.
The survey is carried out in two year cycles starting from the first cycle in 1999-2000. Within each cycle, a set of people are surveyed, however not all of the participants are surveyed across all of the components. Within each of the data categories NHANES has organized the data into Questionnaires and provides web interfaces for descriptions of the contents of each questionnaire. We do not replicate the web interface as users can browse that information using standard tools, but we have constructed an integrated SQL database representation of much of the publicly available NHANES data. Placing all of the data into a single database facilitates searching and extraction of relevant values, and this provides many advantages to the individual researcher, as well as to the research world at large.
Each NHANES participant is assigned a unique ID that is stored in the database as SEQN. This is used as the primary key and merging of data extracted from different tables should be based on this variable (SEQN). For each two-year cycle, NHANES provides a set of data, documentation and code books, organized by each of the five publicly available categories mentioned above. Users can explore the available data at a high level by searching within cycle, category, and measure. One example is the Body Measures table which provides data from the 2001-2002 examination data. The web page provides details on the measurements and how they are recorded. Here are two examples:
There are many ways to access and utilize the NHANES data. Users familiar with SQL can connect and explore the data with database client tools such as HeidiSQL and DataGrip, or other programming languages such as R or Python either through tools that support database connections, or as we do in the phonto package, through custom built wrappers.
Others may use web-based tools provided by the CDC for investigating the data and metadata. In addition there are two existing R packages called nhanesA (Endres) and RNHANES (Susman). While most of the data collected can easily be downloaded from the CDC website, accessing the data in that way can be problematic and error prone and makes sharing and reproducibility more challenging due to the lack of versioning and the potential for changes in either the data or the infrastructure over time. By capturing all components into containers we can aid reproducibility by versioning the containers and by keeping copies of each version for subsequent use.
The nhanesA package provides a set of tools to search and download the NHANES data and metadata, making the data more accessible to user. However, one drawback is that the tool must access the NHANES website every time the user calls the R function, which leads slow getting of the data and raises errors occasionally due to network issues.
In the spirit of producing more easily reproduced and shared analyses we have created a SQL database in a Docker (cite Docker) container that contains most of the currently available NHANES data. For our Dockerized container only public data is included. There are some tables what were not included due to size, or availability when we downloaded the tables to create the database.
Around that SQL database we have constructed a number of R packages
and other tools that help scientists analyze the data. Since all of the
code and data are under version control, anyone can obtain the same
version of the data and code and produce identical results. We have
worked with the author of that package to adapt the functions in that
package to work on the NHANES Docker container. But, since having the
data locally provides a number of opportunities to simplify some data
integration steps we also provide a suite of tools in the
phonto
package (cite phonto).
We can efficiently access NHANES data using phonto
and
nhanesA
together. By simply knowing the name of the data
file or table, one can very quickly and efficiently obtain basic
information, e.g. column names, dimensions, etc.
First we load up the packages we will use for this vignette.
Let’s take a body measure table as an example, (BMX_I).
nhanesColnames("BMX_I")
#> [1] "SEQN" "BMDSTATS" "BMXWT" "BMIWT" "BMXRECUM" "BMIRECUM"
#> [7] "BMXHEAD" "BMIHEAD" "BMXHT" "BMIHT" "BMXBMI" "BMDBMIC"
#> [13] "BMXLEG" "BMILEG" "BMXARML" "BMIARML" "BMXARMC" "BMIARMC"
#> [19] "BMXWAIST" "BMIWAIST" "BMXSAD1" "BMXSAD2" "BMXSAD3" "BMXSAD4"
#> [25] "BMDAVSAD" "BMDSADCM" "SEQN" "BMDSTATS" "BMIWT" "BMIRECUM"
#> [31] "BMIHEAD" "BMIHT" "BMDBMIC" "BMILEG" "BMIARML" "BMIARMC"
#> [37] "BMIWAIST" "BMDSADCM" "BMXWT" "BMXRECUM" "BMXHEAD" "BMXHT"
#> [43] "BMXBMI" "BMXLEG" "BMXARML" "BMXARMC" "BMXWAIST" "BMXSAD1"
#> [49] "BMXSAD2" "BMXSAD3" "BMXSAD4" "BMDAVSAD"
nhanesHead("BMX_I")
#> SEQN BMDSTATS BMIWT BMIRECUM BMIHEAD BMIHT BMDBMIC BMILEG
#> 1 83926 Complete data for age group <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 84498 Complete data for age group <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 84545 Complete data for age group <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 83999 Complete data for age group <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 84065 Complete data for age group <NA> <NA> <NA> <NA> <NA> <NA>
#> BMIARML BMIARMC BMIWAIST BMDSADCM BMXWT BMXRECUM BMXHEAD BMXHT BMXBMI BMXLEG
#> 1 <NA> <NA> <NA> <NA> 77.8 NA NA 170.5 26.8 39.2
#> 2 <NA> <NA> <NA> <NA> 82.1 NA NA 164.0 30.5 36.0
#> 3 <NA> <NA> <NA> <NA> 92.2 NA NA 175.3 30.0 38.0
#> 4 <NA> <NA> <NA> <NA> 73.4 NA NA 171.5 25.0 37.0
#> 5 <NA> <NA> <NA> <NA> 67.4 NA NA 161.4 25.9 36.0
#> BMXARML BMXARMC BMXWAIST BMXSAD1 BMXSAD2 BMXSAD3 BMXSAD4 BMDAVSAD
#> 1 37 31.5 93.5 22.9 23.0 NA NA 23.0
#> 2 37 33.1 99.0 23.2 23.3 NA NA 23.3
#> 3 37 34.1 107.0 24.4 24.3 NA NA 24.4
#> 4 37 32.2 89.9 20.4 20.8 NA NA 20.6
#> 5 37 30.6 89.6 18.3 18.5 NA NA 18.4
nhanesTail("BMX_I")
#> SEQN BMDSTATS BMIWT BMIRECUM BMIHEAD BMIHT BMDBMIC
#> 5 93698 No body measures exam data <NA> <NA> <NA> <NA> <NA>
#> 4 93699 Complete data for age group <NA> <NA> <NA> <NA> Overweight
#> 3 93700 Complete data for age group <NA> <NA> <NA> <NA> <NA>
#> 2 93701 Complete data for age group <NA> <NA> <NA> <NA> Normal weight
#> 1 93702 Complete data for age group <NA> <NA> <NA> <NA> <NA>
#> BMILEG BMIARML BMIARMC BMIWAIST BMDSADCM BMXWT BMXRECUM BMXHEAD BMXHT BMXBMI
#> 5 <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA
#> 4 <NA> <NA> <NA> <NA> <NA> 29.0 NA NA 126.2 18.2
#> 3 <NA> <NA> <NA> <NA> <NA> 78.2 NA NA 173.3 26.0
#> 2 <NA> <NA> <NA> <NA> <NA> 28.8 NA NA 126.0 18.1
#> 1 <NA> <NA> <NA> <NA> <NA> 58.3 NA NA 165.0 21.4
#> BMXLEG BMXARML BMXARMC BMXWAIST BMXSAD1 BMXSAD2 BMXSAD3 BMXSAD4 BMDAVSAD
#> 5 NA NA NA NA NA NA NA NA NA
#> 4 NA 26.9 20.7 62.9 NA NA NA NA NA
#> 3 40.3 37.5 30.6 98.9 21.7 21.8 NA NA 21.8
#> 2 30.5 25.6 20.8 62.7 14.3 14.0 NA NA 14.2
#> 1 38.2 33.5 26.2 72.5 16.9 16.9 NA NA 16.9
Comprehensive lists of NHANES variables are maintained for each data group. For example, the demographics variables are available at https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=Demographics. This section describes how to search the NHANES database.
The function nhanesSearchTableNames
lets users search
for tables in the database using the table name. For example, we can
search the blood pressure related tables using the following code. We
search for the string “BPX” as the parameters based on the CDC table
name conventions. This also demonstrates the naming convention that is
often (although not always) used by the CDC. There is generally a base
name (some short set of uppercase characters) and then a suffix, for all
but the first cycle, that increases lexicographically one letter per
cycle. If you click on the next button in the table below you will see
that there is one table named “BPXO_J” which is a bit different from the
others. According to the documentation during the 2017-2018 cycle a
BP methodology study was conducted to compare BP measurements obtained
by a physician using a mercury sphygmomanometer to those obtained by a
health technician using an oscillometric device.
We provide access to functions within the database for matching strings. In the example below the string “BPX[_]” string is passed to the database engine directly. This string matches only tables containing the string “BPX_”; therefore, the table named “BPXO_J” will not match. We note that the table named “BPX” also has no underscore and it also will not match. Recall that in the first survey cycle the table names have no underscore or suffix.
##FIXME: somehow we need to figure out a search strategy that puts some of the output into an HTML page, or a searchable datatable ##FIXME (TJFS): I am not sure what you mean by this ## the problem with the return values is that they are often long and fairly complex, that is not easy to go through at the R level - if we push this into a datatable and open it in the browser then they could search through that. ## FIXME: one way to do this is to just capture the datatable output and push it to html and open it in the browser -
One of the main challenges in using the NHANES data set is finding the tables that contain the data you want to analyze. In this section we outline some of the basic methods you can use to do the searching. The returned value can be put into a searchable data table that you can use to further filter, sort and search for specific variables.
The primary purpose of the nhanesSearch
function is to
facilitate the search for specific text strings within the variable
description field. By utilizing this function, users can obtain a
convenient data frame that includes the names of variables whose
descriptions meet the specified criteria. Furthermore, the function
offers additional filtering options, such as specifying a start and stop
year, enabling users to further refine their search results.
# nhanesSearch use examples
#
# Search on the word bladder, restrict to the 2001-2008 surveys,
# print out 50 characters of the variable description
bl = nhanesSearch("bladder", ystart=2001, ystop=2008, nchar=50)
dim(bl)
#> [1] 26 7
#
# Search on "urin" (will match urine, urinary, etc), from 1999-2010, return table names only
urin = nhanesSearch("urin", ignore.case=TRUE, ystop=2010, namesonly = TRUE)
length(urin)
#> [1] 248
urin[1:10]
#> [1] "AGQ_D" "AL_IGE_D" "ALB_CR_D" "ALB_CR_E" "ALB_CR_F" "ALDUST_D"
#> [7] "ALQY_F" "AQQ_E" "AQQ_F" "BAQ"
#
The nhanesSearch
function also provides an option to
exclude specific matches. In the previous search, any variable name
containing the word “during” would have been considered a match.
However, by excluding these terms in the output, the function ensures
results that more accurately reflect your desired specifications.
urinEx = nhanesSearch("urin", exclude_terms="during", ignore.case=TRUE, ystop=2010, namesonly=TRUE)
length(urinEx)
#> [1] 96
urinEx[1:10]
#> [1] "AL_IGE_D" "ALB_CR_D" "ALB_CR_E" "ALB_CR_F" "ALDUST_D" "CAFE_F"
#> [7] "CARB_D" "CARB_E" "CBQPFA_E" "CBQPFA_F"
You can also restrict the search to specific data groups (such as EXAM). Several other types of search are shown below.
#
# Restrict search to 'EXAM' and 'LAB' data groups. Explicitly list matching and exclude terms, leave ignore.case set to default value of FALSE. Search surveys from 2009 to present.
urinrest = nhanesSearch(c("urin", "Urin"), exclude_terms=c("During", "eaten during", "do during"), data_group=c('EXAM', 'LAB'), ystart=2009)
head(urinrest) # namesonly=TRUE by default, and it only returns an vector
#> Variable.Name Variable.Description Data.File.Name
#> 1 URX1NP 1-Aminonaphthalene urine (pg/mL) AA_H
#> 2 URD1NPLC 1-Aminonaphthalene urine Comment Code AA_H
#> 3 URX2NP 2-Aminonaphthalene urine (pg/mL) AA_H
#> 4 URD2NPLC 2-Aminonaphthalene urine Comment Code AA_H
#> 5 URX4BP 4-Aminobiphenyl urine (pg/mL) AA_H
#> 6 URD4BPLC 4-Aminobiphenyl urine Comment Code AA_H
#> Data.File.Description Begin.Year EndYear Component
#> 1 Aromatic Amines - Urine 2013 2014 Laboratory
#> 2 Aromatic Amines - Urine 2013 2014 Laboratory
#> 3 Aromatic Amines - Urine 2013 2014 Laboratory
#> 4 Aromatic Amines - Urine 2013 2014 Laboratory
#> 5 Aromatic Amines - Urine 2013 2014 Laboratory
#> 6 Aromatic Amines - Urine 2013 2014 Laboratory
#
# Search on "tooth" or "teeth", all years
teeth = nhanesSearch(c("tooth", "teeth"), ignore.case=TRUE)
head(teeth)
#> Variable.Name
#> 1 OHQ010
#> 2 OHQ020
#> 3 OHQ630
#> 4 OHQ640
#> 5 OHQ650
#> 6 OHQ660
#> Variable.Description
#> 1 Now I have some questions about {your/SP's} mouth and teeth. How would you describe the condition of {your/SP's} mouth and teeth
#> 2 How often {do you/does SP} limit the kinds or amounts of food {you/s/he} eat{s} because of problems with {your/his/her} teeth or
#> 3 How often during the last year (have you/ has SP) felt that life in general was less satisfying because of problems with (your/h
#> 4 How often during the last year (have you/has SP) had difficulty doing (your/his/her) usual jobs or attending school because of p
#> 5 How often during the last year (have you/has SP's) sense of taste been affected by problems with (your/his/her) teeth mouth or d
#> 6 How often during the last year (have you/has SP) avoided particular foods because of problems with (your/his/her) teeth mouth or
#> Data.File.Name Data.File.Description Begin.Year EndYear Component
#> 1 OHQ_B Oral Health 2001 2002 Questionnaire
#> 2 OHQ_B Oral Health 2001 2002 Questionnaire
#> 3 OHQ_C Oral Health 2003 2004 Questionnaire
#> 4 OHQ_C Oral Health 2003 2004 Questionnaire
#> 5 OHQ_C Oral Health 2003 2004 Questionnaire
#> 6 OHQ_C Oral Health 2003 2004 Questionnaire
#
# Search for variables where the variable description begins with "Tooth"
sttooth = nhanesSearch("^Tooth")
dim(sttooth)
#> [1] 331 7
datatable(sttooth)
Now suppose we wanted to find variables names that contain
information about (low density lipoproteins). To do that one can first
use the nhanesSearch
function to retrieve a master list of
any and all questions matching this search term in it’s
Description. One can then use the output of this search to
identify variable names of interest. The variable name can then be
searched for using the nhanesSearchVarName
function. In the
code below we find a variable names corresponding to LDLs and then find
all the tables that contain a variable with that name.
Scrolling through the datatable we find that there are a number of
variables that correspond to an actual LDL measurement, several are
named LBDLDL
and so we can now search for that
variable.
Information about each variable is stored in the NHANES code book. To
access the NHANES code book one can use the nhanesCodebook
function. The function returns a list of length 5 that provides
pertinent information about the variable. The first four elements
provide basic descriptor information, i.e. Variable
is the
variable name, Description
is the actual text of the
question (or a description of the lab value), Target
tells
you which participants were eligible to be asked and
SasLabel
is the variable label. The 5th element, the
Codebook
, is a data frame providing information about the
structure of the variable, such as which values the variable can take
on. When you want to combine variables across study years you will need
to be careful to ensure that the Codebooks
are compatible.
NHANES has changed variables, added or removed possible answers, and
done many other things as the questionnaires have evolved. It is
not safe to assume that an identical variable name will
have an identical interpretation.
cb1 = nhanesCodebook(nh_table = LDLTabs[1], colname = "LBDLDL")
cb1
#> $LBDLDL
#> $LBDLDL$`Variable Name:`
#> [1] "LBDLDL"
#>
#> $LBDLDL$`SAS Label:`
#> [1] "LDL-cholesterol (mg/dL)"
#>
#> $LBDLDL$`English Text:`
#> [1] "LDL-cholesterol (mg/dL)"
#>
#> $LBDLDL$`Target:`
#> [1] "Both males and females 3 YEARS - 150 YEARS"
#>
#> $LBDLDL$LBDLDL
#> Code.or.Value Value.Description Count Cumulative Skip.to.Item
#> 1 20 to 316 Range of Values 3643 3643 <NA>
#> 2 . Missing 821 4464 <NA>
The alert reader will have noticed the column labeled Cumulative. The
values in this column are provided by NHANES and can be used to check
whether the data extraction you carried out aligns with their reported
values. The column labeled SkipToItem will be non-missing if there was
some complex logic in how the survey was performed. In some cases a set
of questions will be skipped for a subset of the participants, depending
on their answer to the current question. For example, if the
current question was
Have you ever been told by your doctor you have Diabetes?
,
and then there say 10 follow-up questions asking about symptoms, someone
who answered No
would not want to answer those questions,
and so the interviewer will skip over them, to the next relevant
question for anyone who says No
. This makes for a good
survey experience for the participants but it also makes the data
analysis a bit messier. The analyst will have to examine all the
questions that can be skipped and assess how to deal with the values
recorded. While we are on the subject of messy data, the
Target
information can also introduce structured
missingness into your data. Some questions are only relevant to certain
age groups, and in those cases the Target
field will
indicate who is going to be asked. Everyone outside of the
Target
range will have a missing value for that
question.
The unionQuery()
function is designed to aggregate data
across the years for a fixed Questionnaire. Thus we are assuming all the
inputs have essentially the same columns and that all variables are
measuring the same concept across years. This function then aggregates
by appending rows and returns the results as a data frame. If there are
columns that are unique to one (or a few) questionnaires then these will
be filled in as NA in any questionnaires that don’t have that column.
Currently this function does not check the code books for consistency of
inputs across years; analysts will need to do that for themselves.
For an example of how to utilize this function, we use the blood pressure tables BPX, BPX_B,…BPX_J from years 1999-2000 to 2017-2018. User can aggregate some or all of the data contained in relevant tables. Note: we are showing this example for illustrative purposes and have not carefully checking the code book. We encourage you to take the time to always ensure that code book entries are compatible.
The jointQuery()
function, table list of table name and
a set of column names, it merges the researched results and returns the
results as a data frame. The data are joined using SEQN which is the
unique identifier for individuals.
The example below is much more complex than our previous examples, but it is also a more realistic query example. We are collecting data across 4 tables, DEMO, BPQ, HDL and TRIGLY. And also across data collection cycles I and J. We first construct our query string, note that it is important that every table appears twice, once with an **_I** and once with an **_J**. In addition the variables being selected must be the same for each cycle and each table.
cols = list(DEMO_I=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
DEMO_J=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
BPQ_I=c('BPQ050A','BPQ020'),BPQ_J=c('BPQ050A','BPQ020'),
HDL_I=c("LBDHDD"),HDL_J=c("LBDHDD"), TRIGLY_I=c("LBXTR","LBDLDL"),
TRIGLY_J=c("LBXTR","LBDLDL"))
data <- jointQuery(cols)
tdata = data[1:100,]
datatable(tdata)
The NHANES project provides thousands of phenotypes and exposures. Navigating these can be very challenging and we are in the process of developing tools that will aid users in navigating the data quickly and reliably. Developing tools that can better help analysts navigate data at this scale is important. We are patterning our efforts on those that were developed for the UK Biobank (Bycroft et al. 2018) and specifically the PHESANT (Millard et al. 2017) package.
We can run a PHESANT-like process to convert each column into data
types. It also provides the ratio of unique values
(r_unique
), the proportion of zeros (r_zeros
),
and the ratio of NAs (r_NAs
), which is calculated by the
number of unique values, zeros, and NAs divided by total records. The
categorical data types (ordered or unordered) are represented by
integers, and we categorize them as multilevel. For example, education
(DMDEDUC2) is labeled as Multilevel(7) which means it has 7 levels.
Information on whether or not the levels are ordered would have to be
obtain from the on-line NHANES documentation.
We can also find out which variables are categorical.
In the raw NHANES data stored in the database all categorical variables are represented as integers. In order to make use of these for analysis you will need to transform them into factors in R.
Categorical variables are presented with integers as shown below.
RIAGENDR | RIDRETH1 | DMDEDUC2 |
---|---|---|
Male | Non-Hispanic White | College graduate or above |
Male | Non-Hispanic White | High school graduate/GED or equivalent |
Male | Non-Hispanic White | High school graduate/GED or equivalent |
Female | Non-Hispanic White | College graduate or above |
Female | Non-Hispanic Black | Some college or AA degree |
Female | Mexican American | 9-11th grade (Includes 12th grade with no diploma) |
And the real factor levels for year 2003-2004 can be found in the codebook as shown below.
The data from NHANES comes as tables with integer codes for each of the levels of a factor variable. A separate file, one for each cycle and questionnaire, has the map from the codes to the text description of what the levels mean. Because the chances of errors if users have untranslated variables (e.g. treating them as integers for example) we automatically translate all of these tables. There is a way to access the raw data that is explained in a different vignette.
The code below is how you would do the translation using the
nhanesA
package. But in the NHANES docker container this is
not needed.
##now we can translate the variables that need to be translated
t1 = nhanesTranslate("DEMO_J",c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2","years"),data=data)
datatable(t1[1:100,c('SEQN','RIDAGEYR','RIAGENDR', 'RIDRETH1')] )
Currently, we are doing as the following flow chat, but both the
ordered and unordered are considered as multilevel.