Examples of workflows enabled by docker

Deepayan Sarkar


Introduction

One of the goals of the Epiconductor project is to make the analysis of NHANES data simpler. To this end, it provides tools that can be used to (a) download all public NHANES data and and load it into an SQL database, and (b) make snapshots of this database available locally through docker images. The primary goal in distributing the pre-built docker images is to provide a ready-to-use tool for data analysts who may not be conversant with database technologies. The docker image thus includes R and RStudio, along with a version of the nhanesA R package that is configured to obtain data from the SQL database rather than the CDC website. We expect most users of the docker image to use it through the nhanesA package. Such analyses will run even outside the docker environment as well, albeit more slowly as it will involve downloading data from the CDC website.

The potential uses of the docker image goes beyond this, however. As NHANES collects and publishes data on a continuous basis, it is natural to want to combine data across cycles. This is not straightforward because a certain variable of interest may be present in different tables in different cyles, or may not even be present in all cycles. With all data locally available, it is possible to efficiently check where variables are to be found (if at all), examine differences over NHANES cycles, and detect a range of anomalies that any analyst needs to be aware of, but would normally be difficult to find. This vignette provides some illustrative examples, and describes some relevant tools in the phonto package that could be useful to practitioners.

Combining data across cycles

We start with a straightforward illustration of the process of combining data across cycles, where we combine the demographic data table from all cycles. NHANES table names typically have a suffix; the suffixes _A, _B, _C,and so on generally correspond to NHANES cycle years from 1999-2000, 2001-2002, 2003-2004, etc. However, it is important to highlight that not every table strictly adheres to this naming convention. For instance, while DEMO_B and DEMO_C are associated with the 2001-2002 and 2003-2004 cycles, respectively, the corresponding table for the 1999-2000 cycle is named ‘DEMO’, without the _A suffix. While this pattern holds for most tables, certain tables such as SSAFB_A and SSANA_A from the 1999-2000 cycle do include the _A suffix. To assist users in navigating these variations, the nhanesA package includes the nhanesSearchTableNames() function, which allows users to easily locate all table names containing a specific string, thus simplifying the process of identifying relevant table names.

library(nhanesA)
demo_all <- nhanesSearchTableNames("DEMO")
demo_all
 [1] "DEMO"   "DEMO_B" "DEMO_C" "DEMO_D" "DEMO_E" "DEMO_F" "DEMO_G" "DEMO_H"
 [9] "DEMO_I" "DEMO_J" "P_DEMO"

The last table in this list merits special mention. During the 2019-2020 cycle, data collection was disrupted by the COVID-19 pandemic. Therefore, the partial 2019-2020 data (herein 2019-March 2020 data) were combined with data from the previous cycle (2017-2018) to create a nationally representative sample covering 2017-March 2020. These data files have the same basic file name, e.g., DEMO, but add the prefix P_. These ‘pre-pandemic’ files require special handling and the CDC has provided substantial guidance as well as updated survey weights.

We can now collect all these datasets using nhanes() function. Normally, this process is likely to be somewhat slow as several files will need to be downloaded from the CDC website. Within the docker environment, however, it should be almost instantaneous.

all_demo_data <- sapply(demo_all, nhanes, simplify = FALSE)
object.size(all_demo_data) # ~45 MB
47137304 bytes
sapply(all_demo_data, dim)
     DEMO DEMO_B DEMO_C DEMO_D DEMO_E DEMO_F DEMO_G DEMO_H DEMO_I DEMO_J P_DEMO
[1,] 9965  11039  10122  10348  10149  10537   9756  10175   9971   9254  15560
[2,]  145     37     44     43     43     43     48     47     47     46     29

The first row in the output above gives the number of participants in each cycle, and the second row denotes the number of variables in the corresponding DEMO table. We can see that each cycle has around 10,000 participants, who are unique across cycles. Note, however, that the larger number of participants in the P_DEMO dataset is misleading, because many of these participants are actually from the previous cycle as described above. We will drop this table before combining the remaining datasets.

The differing number of variables across cycles indicate that variables are not measured consistently across cycles. In fact, many variables included in the DEMO table in the first cycle were subsequently included in other tables, and others have been dropped altogether or added. We can make a list of the variables that are common to all DEMO tables, and combine the corresponding data subsets together, as follows.

all_demo_data <- head(all_demo_data, -1)
common_vars <- lapply(all_demo_data, names) |> Reduce(f = intersect)
common_vars
 [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIDEXMON" "RIAGENDR" "RIDRETH1"
 [7] "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL" "RIDEXPRG"
[13] "DMDHRGND" "RIDAGEYR" "RIDAGEMN" "DMDHHSIZ" "INDFMPIR" "WTINT2YR"
[19] "WTMEC2YR" "SDMVPSU"  "SDMVSTRA"
demo_combined <-
    lapply(all_demo_data, `[`, common_vars) |>
    do.call(what = rbind) |>
    transform(cycle = substring(SDDSRVYR, 8, 17))
dim(demo_combined)
[1] 101316     22

The combined dataset can be analysed further using standard tools. For example, the following code uses the lattice package to summarize the number of participants by recorded ethnicity and gender by cycle.

library("lattice")
demo_combined |>
    xtabs(~ cycle + RIAGENDR + RIDRETH1, data = _) |>
    array2DF() |>
    dotplot(Value ~ cycle | RIAGENDR,
            groups = RIDRETH1,
            layout = c(1, 2), type = "b",
            par.settings = simpleTheme(pch = 16),
            auto.key = list(columns = 3))
plot of chunk demoplot

plot of chunk demoplot

One must be cautious when combining data across cycles, because the NHANES data are sometimes inconsistent in unexpected ways. As a simple example, consider the DMDEDUC3 variable, which records education level of children and youth. The following code illustrates that the values of this variable have inconsistent capitalization in different cycles.

xtabs(~ cycle + DMDEDUC3, demo_combined)[, 1:4]
            DMDEDUC3
cycle        10th grade 10th Grade 11th grade 11th Grade
  1999-2000           0        307          0        281
  2001-2002           0        324          0        319
  2003-2004           0        284          0        295
  2005-2006           0        307          0        277
  2007-2008           0        154          0        165
  2009-2010           0        193          0        185
  2011-2012         167          0        150          0
  2013-2014         162          0        186          0
  2015-2016         160          0        153          0
  2017-2018         139          0        155          0

Cross-cycle Consistency checks using variable codebooks

In our experience, inconsistencies such as the change in capitalization described above occur quite often, and in a variety of different ways, requiring attention to detail when combining data from across cycles. These inconsistencies are not necessarily mistakes, as NHANES questionnaires and variables are often modified from cycle to cycle. The primary source that must be consulted to identify such inconsistencies are the per-table documentation, and in particular the variable codebooks giving details of how each variable is recorded.

The NHANES database contains the variable codebooks for all tables across all cycles in a single database table called Metadata.VariableCodebook. Once imported into R, this information can be manipulated in various ways to glean information of interest. See the codebook diagnostics vignette for a different set of diagnostics that are enabled by having easy access to this information.

library("phonto")
dim(all_cb <- nhanesQuery("select * from Metadata.VariableCodebook"))
[1] 202018      7
## dim(all_cb <- metaData("Codebook")) # alt interface
all_cb <- dplyr::filter(all_cb, !startsWith(TableName, "P_")) # skip pre-pandemic tables

An analyst would typically be interested in some specific variables that are relevant to their study. The first step is to identify how many cycles these variables were recorded in. To this end, we may start by examining the number of tables each variable appears in across all cycles of continuous NHANES.

var_freq <- 
    all_cb[c("Variable", "TableName")] |> unique() |>
        xtabs(~ Variable, data = _) |>
        sort(decreasing = TRUE)
table(var_freq)
var_freq
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
4720 3861 2358  564  373  696  776  529  266  365   49   46  263   50   11   23 
  17   18   19   20   21   22   23   25   28   29   30   32   40   56   58   66 
   2   33    3   48    1    4    1    3    8    1    1    2    1    1    1    1 
  74  111 
   1    1 

Not surprisingly, many variables appear only once or twice across all cycles, and for such variables combining data across cycles would not be interesting. Variables that appear in multiple cycles, but only once per cycle, may usually be merged as in the DEMO example above, but note that they may appear in different tables. One must be careful about ensuring that the variable is measuring the same quantity in all cycles. This is usually true, but not always. For example, the range of values for the LBCBHC variable in different cycles seem to exhibit some discrepancy.

subset(all_cb, Variable == "LBCBHC")[1:5]
       Variable TableName      CodeOrValue ValueDescription Count
150088   LBCBHC  PSTPOL_D  0.004 to 2.3147  Range of Values   247
150089   LBCBHC  PSTPOL_D                .          Missing     0
150168   LBCBHC  PSTPOL_E 0.0044 to 5.2617  Range of Values   264
150169   LBCBHC  PSTPOL_E                .          Missing     0
150249   LBCBHC  PSTPOL_F    3.465 to 4801  Range of Values   295
150250   LBCBHC  PSTPOL_F                .          Missing     6
150324   LBCBHC  PSTPOL_G    3.536 to 5401  Range of Values   251
150325   LBCBHC  PSTPOL_G                .          Missing     0
150399   LBCBHC  PSTPOL_H    3.536 to 5648  Range of Values   281
150400   LBCBHC  PSTPOL_H                .          Missing     3
150467   LBCBHC  PSTPOL_I    3.536 to 1330  Range of Values   265
150468   LBCBHC  PSTPOL_I                .          Missing     0

More useful information about the variables may be obtained from the Metadata.QuestionnaireVariables table in the database, which contains one row for each variable in each table containing its description, target group, etc., obtained from the HTML documentation of NHANES tables.

dim(all_var <- phonto::nhanesQuery("select * from Metadata.QuestionnaireVariables"))
[1] 50714    11
## dim(all_var <- metaData("Variables"))
subset(all_var, Variable == "LBCBHC")[1:5]
      Variable TableName                       Description
43244   LBCBHC  PSTPOL_D Beta-hexachlorocyclohexane (ng/g)
43277   LBCBHC  PSTPOL_E Beta-hexachlorocyclohexane (ng/g)
43310   LBCBHC  PSTPOL_F Beta-hexachlorocyclohexane (pg/g)
43340   LBCBHC  PSTPOL_G Beta-hexachlorocyclohexane (pg/g)
43370   LBCBHC  PSTPOL_H Beta-hexachlorocyclohexane (pg/g)
43397   LBCBHC  PSTPOL_I Beta-hexachlorocyclohexane (pg/g)
                                           Target
43244 Both males and females 12 YEARS - 150 YEARS
43277 Both males and females 12 YEARS - 150 YEARS
43310 Both males and females 12 YEARS - 150 YEARS
43340 Both males and females 12 YEARS - 150 YEARS
43370 Both males and females 12 YEARS - 150 YEARS
43397 Both males and females 12 YEARS - 150 YEARS
                               SasLabel
43244 Beta-hexachlorocyclohexane (ng/g)
43277 Beta-hexachlorocyclohexane (ng/g)
43310 Beta-hexachlorocyclohexane (pg/g)
43340 Beta-hexachlorocyclohexane (pg/g)
43370 Beta-hexachlorocyclohexane (pg/g)
43397 Beta-hexachlorocyclohexane (pg/g)

This shows that the unit of measurement was changed from the 2009–2010 cycle, explaining the discrepancy. Without a careful check, such changes may be overlooked, leading to errors in interpretation.

It is not easy to systematically detect such changes without manual inspection of variables of interest. One way to shortlist possible candidate variables are to identify those for whom the Description or SasLabel field has changed. Unfortunately, such changes happen frequently for completely benign reasons, leading to many false positives.

Within-cycle consistency

Somewhat more surprisingly, several variables appear in more tables than there are cycles, which means that they must appear in multiple tables within the same cycle. The following variables appear in more than 20 tables.

var_freq[ var_freq > 20 ]
Variable
  URXUCR  WTSA2YR   WTDRD1   DRDINT   WTDR2D RIAGENDR    DRABF  WTSB2YR 
     111       74       66       58       56       40       32       32 
 WTSC2YR WTSAF2YR   DR1DAY DR1DRSTZ DR1EXMER  DR1LANG   DR2DAY DR2DRSTZ 
      30       29       28       28       28       28       28       28 
DR2EXMER  DR2LANG  RIANSMP RIDAGGRP    WTFSM WTSVOC2Y  DR1DBIH  DR2DBIH 
      28       28       25       25       25       23       22       22 
PHAFSTHR PHAFSTMN  DSDSUPP 
      22       22       21 

For such variables, selecting the corresponding subset of all_cb shows all entries in the codebook tables for that variable, across all tables and cycles. For example,

subset(all_cb, Variable == "PHAFSTMN")[1:5]
       Variable TableName CodeOrValue ValueDescription Count
74503  PHAFSTMN  FASTQX_D     0 to 59  Range of Values  8903
74504  PHAFSTMN  FASTQX_D           .          Missing   537
74546  PHAFSTMN  FASTQX_E     0 to 59  Range of Values  8832
74547  PHAFSTMN  FASTQX_E           .          Missing   475
74589  PHAFSTMN  FASTQX_F     0 to 59  Range of Values  9557
74590  PHAFSTMN  FASTQX_F           .          Missing   278
74632  PHAFSTMN  FASTQX_G     0 to 59  Range of Values  8528
74633  PHAFSTMN  FASTQX_G           .          Missing   428
74675  PHAFSTMN  FASTQX_H     0 to 59  Range of Values  9182
74676  PHAFSTMN  FASTQX_H           .          Missing   240
74718  PHAFSTMN  FASTQX_I     0 to 59  Range of Values  8911
74719  PHAFSTMN  FASTQX_I           .          Missing   254
74761  PHAFSTMN  FASTQX_J     0 to 59  Range of Values  7996
74762  PHAFSTMN  FASTQX_J           .          Missing   370
84545  PHAFSTMN     GLU_D     0 to 59  Range of Values  3251
84546  PHAFSTMN     GLU_D           .          Missing   101
84559  PHAFSTMN     GLU_E     0 to 59  Range of Values  3223
84560  PHAFSTMN     GLU_E           .          Missing    92
84573  PHAFSTMN     GLU_F     0 to 59  Range of Values  3535
84574  PHAFSTMN     GLU_F           .          Missing    46
84587  PHAFSTMN     GLU_G     0 to 59  Range of Values  3167
84588  PHAFSTMN     GLU_G           .          Missing    72
84598  PHAFSTMN     GLU_H     0 to 59  Range of Values  3291
84599  PHAFSTMN     GLU_H           .          Missing    38
92762  PHAFSTMN     INS_H     0 to 59  Range of Values  3291
92763  PHAFSTMN     INS_H           .          Missing    38
92776  PHAFSTMN     INS_I     0 to 59  Range of Values  3135
92777  PHAFSTMN     INS_I           .          Missing    56
110377 PHAFSTMN    OGTT_D     0 to 59  Range of Values  3251
110378 PHAFSTMN    OGTT_D           .          Missing   101
110395 PHAFSTMN    OGTT_E     0 to 59  Range of Values  3223
110396 PHAFSTMN    OGTT_E           .          Missing    92
110430 PHAFSTMN    OGTT_F     0 to 59  Range of Values  3163
110431 PHAFSTMN    OGTT_F           .          Missing     0
110465 PHAFSTMN    OGTT_G     0 to 59  Range of Values  2815
110466 PHAFSTMN    OGTT_G           .          Missing     0
110505 PHAFSTMN    OGTT_H     0 to 59  Range of Values  2909
110506 PHAFSTMN    OGTT_H           .          Missing     0
148145 PHAFSTMN        PH     0 to 59  Range of Values  8350
148146 PHAFSTMN        PH           .          Missing   482
148188 PHAFSTMN      PH_B     0 to 59  Range of Values  9480
148189 PHAFSTMN      PH_B           .          Missing   449
148231 PHAFSTMN      PH_C     0 to 59  Range of Values  9013
148232 PHAFSTMN      PH_C           .          Missing   166

Inspection of this table shows that the PHAFSTMN variable was initially recorded in the PH table for the first three cycles, after which it was recorded in three different tables (FASTQX, GLU, and OGTT) for several cycles, before being dropped again from the latter two tables. It is natural to wonder whether all these tables contain the same data. This can only be verified by comparing the actual data, which we will not do for this example, but some hints are provided by the data counts included in the codebook. For example, for the 2005–2006 cycle, we have

subset(all_cb, Variable == "PHAFSTMN" & endsWith(TableName, "_D"))[1:5]
       Variable TableName CodeOrValue ValueDescription Count
74503  PHAFSTMN  FASTQX_D     0 to 59  Range of Values  8903
74504  PHAFSTMN  FASTQX_D           .          Missing   537
84545  PHAFSTMN     GLU_D     0 to 59  Range of Values  3251
84546  PHAFSTMN     GLU_D           .          Missing   101
110377 PHAFSTMN    OGTT_D     0 to 59  Range of Values  3251
110378 PHAFSTMN    OGTT_D           .          Missing   101

From the variable metadata table, we see that

subset(all_var, Variable == "PHAFSTMN" & endsWith(TableName, "_D"))[c(1, 2, 4, 5)]
      Variable TableName                                      Target
24274 PHAFSTMN  FASTQX_D  Both males and females 1 YEARS - 150 YEARS
25529 PHAFSTMN     GLU_D Both males and females 12 YEARS - 150 YEARS
31383 PHAFSTMN    OGTT_D Both males and females 12 YEARS - 150 YEARS
                                 SasLabel
24274   Total length of food fast minutes
25529 Total length of 'food fast' minutes
31383 Total length of 'food fast' minutes

While not definitive, this suggests that the PHAFSTMN variable measures the same quantity in all three tables, and the difference in number of observations may be due to the difference in target age group.

Even if a preliminary inspection suggests no obvious problems, one should verify by comparing the actual recorded data. For example, consider the ENQ100 variable, which appears in both ENX and SPX tables.

subset(all_var, Variable == "ENQ100" & endsWith(TableName, "_E"))[c(1, 2, 4, 5)]
      Variable TableName                                    Target
23932   ENQ100     ENX_E Both males and females 6 YEARS - 79 YEARS
45401   ENQ100     SPX_E Both males and females 6 YEARS - 79 YEARS
                             SasLabel
23932 Cough cold resp illness 7 days?
45401        Had respiratory illness?
merge(nhanes("ENX_E")[c("SEQN", "ENQ100")],
      nhanes("SPX_E")[c("SEQN", "ENQ100")], by = "SEQN") |>
    xtabs(~ ENQ100.x + ENQ100.y, data =_, addNA = TRUE)
            ENQ100.y
ENQ100.x     Don't know   No  Yes <NA>
  Don't know          3    0    0    0
  No                  0 5013    0    0
  Yes                 0    0 1354    0
  <NA>                0  217   63 1093

Comparing the records in the two tables after matching by SEQN, the participant identifier, we see that even though most records are consistent, several records with Yes or No answers in the SPX tables are recorded as NA (missing) in the ENX tables.

A more egregious example, where the same variable is clearly measuring two different things, is provided by the LBXHCT variable.

subset(all_cb, Variable == "LBXHCT" & endsWith(TableName, "_H"))[1:5]
      Variable TableName   CodeOrValue ValueDescription Count
24512   LBXHCT     CBC_H  17.9 to 56.5  Range of Values  8544
24513   LBXHCT     CBC_H             .          Missing   878
33809   LBXHCT     COT_H 0.011 to 1150  Range of Values  8029
33810   LBXHCT     COT_H             .          Missing   884
subset(all_var, Variable == "LBXHCT" & endsWith(TableName, "_H"))[c(1, 2, 3, 4, 5)]
      Variable TableName                   Description
9538    LBXHCT     CBC_H                Hematocrit (%)
11531   LBXHCT     COT_H Hydroxycotinine Serum (ng/mL)
                                          Target                      SasLabel
9538  Both males and females 1 YEARS - 150 YEARS                Hematocrit (%)
11531 Both males and females 3 YEARS - 150 YEARS Hydroxycotinine Serum (ng/mL)
merge(nhanes("CBC_H")[c("SEQN", "LBXHCT")],
      nhanes("COT_H")[c("SEQN", "LBXHCT")], by = "SEQN") |> head()
   SEQN LBXHCT.x LBXHCT.y
1 73557     45.4    1.330
2 73558     36.7    4.480
3 73559     49.9    0.055
4 73560     37.8    0.025
5 73561     43.8    0.011
6 73562     41.5    0.011

Quality control reports

Getting a comprehensive picture of the issues relevant for a particular variable may take some work, as seen in the above examples. However, much of the preliminary work can be automated, and the qc_var() function in the phonto package dowa precisely this.

The report for the LBCBHC variable can be generated using the following call.

qc_var("LBCBHC")
Variable:  LBCBHC
Mismatch in Description:
                        description Frequency
1 Beta-hexachlorocyclohexane (ng/g)         2
2 Beta-hexachlorocyclohexane (pg/g)         4

Mismatch in Saslabel:
                           saslabel Frequency
1 Beta-hexachlorocyclohexane (ng/g)         2
2 Beta-hexachlorocyclohexane (pg/g)         4

Similarly, the reports for PHAFSTMN (where the issues flagged are relatively benign) and LBXHCT (where they are not) are obtained as follows.

qc_var("PHAFSTMN")
Variable:  PHAFSTMN
Appears in multiple tables within same cycle:
       cycle                            Tables
1  1999-2000                                PH
2  2001-2002                              PH_B
3  2003-2004                              PH_C
4  2005-2006         FASTQX_D / GLU_D / OGTT_D
5  2007-2008         FASTQX_E / GLU_E / OGTT_E
6  2009-2010         FASTQX_F / GLU_F / OGTT_F
7  2011-2012         FASTQX_G / GLU_G / OGTT_G
8  2013-2014 FASTQX_H / GLU_H / INS_H / OGTT_H
9  2015-2016                  FASTQX_I / INS_I
10 2017-2018                          FASTQX_J
11 2017-2020                          P_FASTQX

Mismatch in Description:
                                                                                                                     description
1 The time (in minutes) between when the examinee last ate or  drank anything other than water and the time of the venipuncture.
2  The time (in minutes) between when the examinee last ate or drank anything other than water and the time of the venipuncture.
3                                                                                            Total length of 'food fast' minutes
  Frequency
1         1
2        12
3        10

Mismatch in Saslabel:
                             saslabel Frequency
1 Total length of 'food fast' minutes        12
2   Total length of food fast minutes        11

Mismatch in Target:
                                       target Frequency
1  Both males and females 1 YEARS - 150 YEARS        11
2 Both males and females 12 YEARS - 150 YEARS        12
qc_var("LBXHCT")
Variable:  LBXHCT
Appears in multiple tables within same cycle:
       cycle        Tables
1  1999-2000         LAB25
2  2001-2002         L25_B
3  2003-2004         L25_C
4  2005-2006         CBC_D
5  2007-2008         CBC_E
6  2009-2010         CBC_F
7  2011-2012         CBC_G
8  2013-2014 CBC_H / COT_H
9  2015-2016 CBC_I / COT_I
10 2017-2018 CBC_J / COT_J
11 2017-2020         P_CBC

Mismatch in Description:
                    description Frequency
1                Hematocrit (%)        11
2 Hydroxycotinine Serum (ng/mL)         3

Mismatch in Saslabel:
                       saslabel Frequency
1                Hematocrit (%)        11
2 Hydroxycotinine Serum (ng/mL)         3

Mismatch in Target:
                                      target Frequency
1 Both males and females 1 YEARS - 150 YEARS        11
2 Both males and females 3 YEARS - 150 YEARS         3

As we can see, these reports give an incomplete but suggestive initial list of potential issues, which can be followed up on by the analyst if necessary.

Other useful checks that have not yet been implemented but are planned include:

  • Special codes in numeric variables: For numeric variables, special codes are sometimes used to indicate unusual values. A common use is to have special codes to disambiguate various types of missing responses (e.g., "Refused" and "Don't know"). Some of these are already converted to NA values by default, but others may remain.

  • One particular instance of special codes represent data coarsening, that is, clubbing of numeric values over a range (e.g., ages over 80 are usually replaced by 80).

The plan is to report the presence of such codes in variable QC reports.

Variables that are potentially skipped

Some variables in a table may be skipped depending on the response to a previous question. This information is recorded in the variable codebooks. For example, in table WHQ, if the response to WHD060 is “Yes”, then subsequent questions are skipped until WHQ090 is reached.

nhanesCodebook("WHQ", "WHD060")
$WHD060
$WHD060$`Variable Name:`
[1] "WHD060"

$WHD060$`SAS Label:`
[1] "Weight change intentional"

$WHD060$`English Text:`
[1] "Was the change between {your/SP's} current weight and {your/his/her} weight a year ago intentional?"

$WHD060$`Target:`
[1] "Both males and females 16 YEARS - 150 YEARS"

$WHD060$WHD060
  Code.or.Value Value.Description Count Cumulative Skip.to.Item
1             1               Yes   508        508       WHQ090
2             2                No   376        884         <NA>
3             7           Refused     0        884         <NA>
4             9        Don't know     0        884         <NA>
5             .           Missing  5160       6044         <NA>

This information, while useful, does not directly tell us which variables were skipped. This may be important for the analyst to know, because a variable that was skipped is recorded as a missing value, whereas we may actually have additional information based on whether it was skipped or not. In the above example, the immediately following variable is WHQ070.

nhanesCodebook("WHQ", "WHQ070")
$WHQ070
$WHQ070$`Variable Name:`
[1] "WHQ070"

$WHQ070$`SAS Label:`
[1] "Tried to lose weight in past year"

$WHQ070$`English Text:`
[1] "During the past 12 months {have you/has SP} tried to lose weight?"

$WHQ070$`Target:`
[1] "Both males and females 16 YEARS - 150 YEARS"

$WHQ070$WHQ070
  Code.or.Value Value.Description Count Cumulative Skip.to.Item
1             1               Yes  1618       1618         <NA>
2             2                No  3889       5507       WHQ090
3             7           Refused     8       5515       WHQ090
4             9        Don't know    11       5526       WHQ090
5             .           Missing   518       6044         <NA>

This codebook gives no indication of whether this question could have been skipped for some participants. However, from the previous codebook, was can infer that out of the 518 missing values, 508 were presumably from those that said “Yes” to WHD060, and these are qualitatively different from the remaining 10. While the question of how to address such situations is for the analyst to answer, it would be useful to flag variables that are potentially skipped.

The get_skip_info() function in the phonto package reports the variables that are potentially skipped in any given table. The following code applies it on the WHQ table.

get_skip_info("WHQ")
   Table Variable MaybeSkipped                SkippedDueTo
1    WHQ   WHD010        FALSE                            
2    WHQ   WHD020        FALSE                            
3    WHQ   WHQ030        FALSE                            
4    WHQ   WHD040        FALSE                            
5    WHQ   WHD050        FALSE                            
6    WHQ   WHD060        FALSE                            
7    WHQ   WHQ070         TRUE                      WHD060
8    WHQ  WHD080A         TRUE WHD060,WHQ070,WHQ070,WHQ070
9    WHQ  WHD080B         TRUE WHD060,WHQ070,WHQ070,WHQ070
10   WHQ  WHQ080C         TRUE WHD060,WHQ070,WHQ070,WHQ070
11   WHQ  WHD080D         TRUE WHD060,WHQ070,WHQ070,WHQ070
12   WHQ  WHQ080E         TRUE WHD060,WHQ070,WHQ070,WHQ070
13   WHQ  WHD080F         TRUE WHD060,WHQ070,WHQ070,WHQ070
14   WHQ  WHQ080G         TRUE WHD060,WHQ070,WHQ070,WHQ070
15   WHQ  WHQ080H         TRUE WHD060,WHQ070,WHQ070,WHQ070
16   WHQ  WHQ080I         TRUE WHD060,WHQ070,WHQ070,WHQ070
17   WHQ  WHQ080J         TRUE WHD060,WHQ070,WHQ070,WHQ070
18   WHQ  WHQ080K         TRUE WHD060,WHQ070,WHQ070,WHQ070
19   WHQ  WHD080L         TRUE WHD060,WHQ070,WHQ070,WHQ070
20   WHQ  WHD080M         TRUE WHD060,WHQ070,WHQ070,WHQ070
21   WHQ  WHD080N         TRUE WHD060,WHQ070,WHQ070,WHQ070
22   WHQ   WHQ090        FALSE                            
23   WHQ  WHQ100A         TRUE        WHQ090,WHQ090,WHQ090
24   WHQ  WHQ100B         TRUE        WHQ090,WHQ090,WHQ090
25   WHQ  WHD100C         TRUE        WHQ090,WHQ090,WHQ090
26   WHQ  WHD100D         TRUE        WHQ090,WHQ090,WHQ090
27   WHQ  WHQ100E         TRUE        WHQ090,WHQ090,WHQ090
28   WHQ  WHD100F         TRUE        WHQ090,WHQ090,WHQ090
29   WHQ  WHQ100G         TRUE        WHQ090,WHQ090,WHQ090
30   WHQ  WHQ100H         TRUE        WHQ090,WHQ090,WHQ090
31   WHQ  WHQ100I         TRUE        WHQ090,WHQ090,WHQ090
32   WHQ  WHQ100J         TRUE        WHQ090,WHQ090,WHQ090
33   WHQ  WHQ100K         TRUE        WHQ090,WHQ090,WHQ090
34   WHQ  WHD100L         TRUE        WHQ090,WHQ090,WHQ090
35   WHQ  WHD100M         TRUE        WHQ090,WHQ090,WHQ090
36   WHQ  WHD100N         TRUE        WHQ090,WHQ090,WHQ090
37   WHQ   WHD110        FALSE                            
38   WHQ   WHD120        FALSE                            
39   WHQ   WHD130        FALSE                            
40   WHQ   WHD140        FALSE                            
41   WHQ   WHD150        FALSE                            
42   WHQ   WHD160        FALSE                            
43   WHQ   WHD170        FALSE                            

Searching variables

Having access to all variable descriptions in a single dataset makes search operations convenient. For example,

all_var |> within({ Description <- tolower(Description) }) |>
    subset(endsWith(TableName, "_C") &
           (grepl("hypertension", Description) |
            grepl("blood pressure", Description)),
           select = c(1, 2, 5))
      Variable TableName                                 SasLabel
8894    BPQ010     BPQ_C    Last blood pressure reading by doctor
8895    BPQ020     BPQ_C    Ever told you had high blood pressure
8896    BPQ030     BPQ_C  Told had high blood pressure - 2+ times
8897   BPQ040A     BPQ_C     Taking prescription for hypertension
8898   BPQ040B     BPQ_C  Told to control weight for hypertension
8899   BPQ040C     BPQ_C   Told to reduce sodium for hypertension
8900   BPQ040D     BPQ_C   Told to exercise more for hypertension
8901   BPQ040E     BPQ_C  Told to reduce alcohol for hypertension
8902   BPQ040F     BPQ_C Told to do other things for hypertension
9097  PEASCST1     BPX_C                    Blood Pressure Status
9098  PEASCTM1     BPX_C           Blood Pressure Time in Seconds
9099  PEASCCT1     BPX_C                   Blood Pressure Comment
9112    BPXSY1     BPX_C     Systolic: Blood pres (1st rdg) mm Hg
9113    BPXDI1     BPX_C    Diastolic: Blood pres (1st rdg) mm Hg
9115    BPXSY2     BPX_C     Systolic: Blood pres (2nd rdg) mm Hg
9116    BPXDI2     BPX_C    Diastolic: Blood pres (2nd rdg) mm Hg
9118    BPXSY3     BPX_C     Systolic: Blood pres (3rd rdg) mm Hg
9119    BPXDI3     BPX_C    Diastolic: Blood pres (3rd rdg) mm Hg
9121    BPXSY4     BPX_C     Systolic: Blood pres (4th rdg) mm Hg
9122    BPXDI4     BPX_C    Diastolic: Blood pres (4th rdg) mm Hg
11863  CVQ220C     CVX_C             Priority 2 Stop excessive BP
11864  CVQ220E     CVX_C  Priority 2 Stop significant drop in SBP
11880   CVAARM     CVX_C  Arm selected for blood pressure monitor
11881  CVACUFF     CVX_C     Cuff size for blood pressure monitor
11888   CVDWSY     CVX_C              Warm-up systolic BP (mm Hg)
11889   CVDWDI     CVX_C             Warm-up diastolic BP (mm Hg)
11895  CVDS1SY     CVX_C              Stage 1 systolic BP (mm Hg)
11896  CVDS1DI     CVX_C             Stage 1 diastolic BP (mm Hg)
11902  CVDS2SY     CVX_C              Stage 2 systolic BP (mm Hg)
11903  CVDS2DI     CVX_C             Stage 2 diastolic BP (mm Hg)
11906  CVDR1SY     CVX_C           Recovery 1 systolic BP (mm Hg)
11907  CVDR1DI     CVX_C          Recovery 1 diastolic BP (mm Hg)
11910  CVDR2SY     CVX_C           Recovery 2 systolic BP (mm Hg)
11911  CVDR2DI     CVX_C          Recovery 2 diastolic BP (mm Hg)
11914  CVDR3SY     CVX_C           Recovery 3 systolic BP (mm Hg)
11915  CVDR3DI     CVX_C          Recovery 3 diastolic BP (mm Hg)
29638  LEXBRP1   LEXAB_C                   Brachial SBP 1 (mm Hg)
29639  LEXBRP2   LEXAB_C                   Brachial SBP 2 (mm Hg)
29640  LEXBRPM   LEXAB_C                Mean Brachial SBP (mm Hg)
29641 LEXLPTS1   LEXAB_C      Left Posterior Tibial SBP 1 (mm Hg)
29642 LEXLPTS2   LEXAB_C      Left Posterior Tibial SBP 2 (mm Hg)
29643 LEXLPTSM   LEXAB_C   Left Mean Posterior Tibial SBP (mm Hg)
29645 LEXRPTS1   LEXAB_C     Right Posterior Tibial SBP 1 (mm Hg)
29646 LEXRPTS2   LEXAB_C     Right Posterior Tibial SBP 2 (mm Hg)
29647 LEXRPTSM   LEXAB_C  Right Mean Posterior Tibial SBP (mm Hg)
29651 LEALAPNC   LEXAB_C              Left ankle SBP > 255 mm Hg?
29652 LEARAPNC   LEXAB_C             Right ankle SBP > 255 mm Hg?
30223  MCQ250F     MCQ_C    Blood relatives w/hypertension/stroke
42272  PFD069J     PFQ_C  Hypertension or high blood pressuredays

Information in this table can be supplemented using the nhanesTableSummary() function, which computes further information such as the variable type and number of non-missing observations; see this vignette for more details and links to publicly accessible online search interfaces.