Diagnostics: Codebook Inconsistencies


NHANES tables themselves have cryptic variable names, and must be used in conjunction with corresponding documentation files to be interpreted. Both standard and database versions of the nhanes() and nhanesFromURL() functions in the nhanesA package return a “translated” data frame, which modify the raw data columns in the SAS transport files using per-variable translation tables, referred to as codebooks, obtained from the NHANES online documentation.

This document describes a series of diagnostic checks to identify possible issues with these codebooks.

Variable codebooks

Variable codebooks are obtained by downloading and parsing online documentation files. These codebooks are stored in the database, making it relatively easy to work with them.

library(nhanesA)
library(phonto)
all_cb <- nhanesQuery("select * from Metadata.VariableCodebook")
str(all_cb)
'data.frame':	202018 obs. of  7 variables:
 $ Variable        : chr  "WTSA2YR" "WTSA2YR" "WTSA2YR" "URX1NP" ...
 $ TableName       : chr  "AA_H" "AA_H" "AA_H" "AA_H" ...
 $ CodeOrValue     : chr  "16284.37488 to 530325.34726" "0" "." "0.91 to 441" ...
 $ ValueDescription: chr  "Range of Values" "No Lab Result" "Missing" "Range of Values" ...
 $ Count           : int  2724 31 0 2478 277 1776 702 0 277 2488 ...
 $ Cumulative      : int  2724 2755 2755 2478 2755 1776 2478 2478 2755 2488 ...
 $ SkipToItem      : chr  NA NA NA NA ...

Ambiguous variable types

NHANES has both numeric and categorical variables. There is no indication in the data or documentation itself of what type a certain variable is supposed to be. However, for most numeric variables, the ValueDescription column will have an entry called "Range of Values". The presence of this value is used by the nhanesA package to infer the type of a variable.

Unfortunately, with this rule, some variables are flagged as numeric in some cycles but categorical in others. Such variables can be identified in the searchable variable tables available here with a Type value of ambiguous. Below, we try to take a closer look at such variables.

We first restrict our attention to variables that are ‘numeric’ in at least one table. There may be others that are mistakenly classified as numeric, but those may be difficult to flag.

numeric_vars <- with(all_cb, unique(Variable[ValueDescription == "Range of Values"]))
numeric_cb <- subset(all_cb, Variable %in% numeric_vars, select = 1:5)

Ideally, all the ‘numeric’ values in these codebooks should be identified as "Range of Values". If they are not, however, they are usually just the numeric value, or some indicator of thresholding such as "more than 80". Let us look at the ‘ValueDescription’-s that represent numeric values, in the sense that they can be coerced to a finite numeric value.

maybe_numeric <- is.finite(as.numeric(numeric_cb$ValueDescription))
Warning: NAs introduced by coercion
table(maybe_numeric)
maybe_numeric
FALSE  TRUE 
76764   480 

We will focus on these variables for now.

problem_vars <- unique(numeric_cb[maybe_numeric, ]$Variable)
str(problem_vars)
 chr [1:183] "AUXR1K2L" "AUXR8KR" "AUXR2KR" "AUXR1K2R" "AUXR3KR" "BAXFTC12" ...
length(num_cb_byVar <- numeric_cb |>
           subset(Variable %in% problem_vars) |>
           split(~ Variable))
[1] 183

Let’s start by summarizing these to keep only the unique CodeOrValue + ValueDescription combinations, and then prioritize them by the number of numeric-like values that remain.

summary_byVar <-
    lapply(num_cb_byVar,
           function(d) unique(d[c("Variable", "CodeOrValue",
                                  "ValueDescription")]))
numNumeric <- function(d) {
    suppressWarnings(sum(is.finite(as.numeric(d$ValueDescription))))
}
(nnum <- sapply(summary_byVar, numNumeric) |> sort())
AUXR1K2R  AUXR2KR  AUXR3KR  AUXR8KR BAXFTC12 CVDR3TIM  DR2LANG DRD370JQ  DUQ310Q 
       1        1        1        1        1        1        1        1        1 
 DUQ350Q   DUQ390   DXXSPY  LBDBANO  LBDEONO   LBDRPI   LBXV2P   LBXVDX   LBXVTP 
       1        1        1        1        1        1        1        1        1 
 MCQ240D  MCQ240H  MCQ240K  MCQ240M  MCQ240Q  MCQ240V OSD030CC OSD030CD  OSD110H 
       1        1        1        1        1        1        1        1        1 
 PFD069L   SSDBZP SSMTBRPS SSMTBRSG SSWT0306   SXQ267   SXQ410   SXQ550   SXQ836 
       1        1        1        1        1        1        1        1        1 
  SXQ841   URX1DC   URXMTO   URXOMO   URXP09   URXPTU   URXTCV   URXUBE WTSAF2YR 
       1        1        1        1        1        1        1        1        1 
WTSAF4YR  WTSPH01  WTSPH02  WTSPH03  WTSPH04  WTSPH05  WTSPH06  WTSPH07  WTSPH08 
       1        1        1        1        1        1        1        1        1 
 WTSPH09  WTSPH10  WTSPH11  WTSPH12  WTSPH13  WTSPH14  WTSPH15  WTSPH16  WTSPH17 
       1        1        1        1        1        1        1        1        1 
 WTSPH18  WTSPH19  WTSPH20  WTSPH21  WTSPH22  WTSPH23  WTSPH24  WTSPH25  WTSPH26 
       1        1        1        1        1        1        1        1        1 
 WTSPH27  WTSPH28  WTSPH29  WTSPH30  WTSPH31  WTSPH32  WTSPH33  WTSPH34  WTSPH35 
       1        1        1        1        1        1        1        1        1 
 WTSPH36  WTSPH37  WTSPH38  WTSPH39  WTSPH40  WTSPH41  WTSPH42  WTSPH43  WTSPH44 
       1        1        1        1        1        1        1        1        1 
 WTSPH45  WTSPH46  WTSPH47  WTSPH48  WTSPH49  WTSPH50  WTSPH51  WTSPH52  WTSPO01 
       1        1        1        1        1        1        1        1        1 
 WTSPO02  WTSPO03  WTSPO04  WTSPO05  WTSPO06  WTSPO07  WTSPO08  WTSPO09  WTSPO10 
       1        1        1        1        1        1        1        1        1 
 WTSPO11  WTSPO12  WTSPO13  WTSPO14  WTSPO15  WTSPO16  WTSPO17  WTSPO18  WTSPO19 
       1        1        1        1        1        1        1        1        1 
 WTSPO20  WTSPO21  WTSPO22  WTSPO23  WTSPO24  WTSPO25  WTSPO26  WTSPO27  WTSPO28 
       1        1        1        1        1        1        1        1        1 
 WTSPO29  WTSPO30  WTSPO31  WTSPO32  WTSPO33  WTSPO34  WTSPO35  WTSPO36  WTSPO37 
       1        1        1        1        1        1        1        1        1 
 WTSPO38  WTSPO39  WTSPO40  WTSPO41  WTSPO42  WTSPO43  WTSPO44  WTSPO45  WTSPO46 
       1        1        1        1        1        1        1        1        1 
 WTSPO47  WTSPO48  WTSPO49  WTSPO50  WTSPO51  WTSPO52 AUXR1K2L DRD370PQ   DUQ340 
       1        1        1        1        1        1        2        2        2 
  DUQ360  DUQ400Q MCQ240AA MCQ240DK  MCQ240L  MCQ240T OSD030BG  OSD110F   SMD415 
       2        2        2        2        2        2        2        2        2 
 SMD415A   URX2DC DMDHHSZA DMDHHSZE OSD030BF OSD030CE  OSQ020A  RHQ602Q DMDHHSZB 
       2        2        3        3        3        3        3        3        4 
 MCQ240Y OSD030AC DMDFMSIZ DMDHHSIZ   HUD080  MCQ240B  OSQ020C  OSQ020B  ECD070A 
       4        4        6        6        6        6        6        7       12 
  HOD050   HSQ580   KID221 
      12       12       24 

To get a sense of the problem cases, we look at the variables with 10 or more numeric variables.

num_cb_byVar[ names(which(nnum >= 10)) ]
$ECD070A
       Variable TableName CodeOrValue  ValueDescription Count
72819   ECD070A  EC24_K_R     4 to 10   Range of Values   375
72820   ECD070A  EC24_K_R           3  3 pounds or less    10
72821   ECD070A  EC24_K_R          11 11 pounds or more     0
72822   ECD070A  EC24_K_R        7777           Refused     0
72823   ECD070A  EC24_K_R        9999        Don't know     4
72824   ECD070A  EC24_K_R           .           Missing     0
72870   ECD070A       ECQ           1                 1    36
72871   ECD070A       ECQ           2                 2    19
72872   ECD070A       ECQ           3                 3    39
72873   ECD070A       ECQ           4                 4    89
72874   ECD070A       ECQ           5                 5   264
72875   ECD070A       ECQ           6                 6   869
72876   ECD070A       ECQ           7                 7  1317
72877   ECD070A       ECQ           8                 8   805
72878   ECD070A       ECQ           9                 9   237
72879   ECD070A       ECQ          10                10    87
72880   ECD070A       ECQ          11                11    15
72881   ECD070A       ECQ          12                12     1
72882   ECD070A       ECQ          13 13 pounds or more     3
72883   ECD070A       ECQ          77           Refused     2
72884   ECD070A       ECQ          99        Don't know   133
72885   ECD070A       ECQ           .           Missing     5
72956   ECD070A     ECQ_B     1 to 12   Range of Values  4257
72957   ECD070A     ECQ_B          13 13 pounds or more     6
72958   ECD070A     ECQ_B        7777           Refused     0
72959   ECD070A     ECQ_B        9999        Don't know   141
72960   ECD070A     ECQ_B           .           Missing     1
73023   ECD070A     ECQ_C     1 to 12   Range of Values  3791
73024   ECD070A     ECQ_C          13 13 pounds or more     2
73025   ECD070A     ECQ_C        7777           Refused     1
73026   ECD070A     ECQ_C        9999        Don't know   113
73027   ECD070A     ECQ_C           .           Missing     2
73089   ECD070A     ECQ_D     1 to 12   Range of Values  4071
73090   ECD070A     ECQ_D          13 13 pounds or more     4
73091   ECD070A     ECQ_D        7777           Refused     1
73092   ECD070A     ECQ_D        9999        Don't know   131
73093   ECD070A     ECQ_D           .           Missing     2
73162   ECD070A     ECQ_E     1 to 12   Range of Values  3538
73163   ECD070A     ECQ_E          13 13 pounds or more     4
73164   ECD070A     ECQ_E        7777           Refused     0
73165   ECD070A     ECQ_E        9999        Don't know    60
73166   ECD070A     ECQ_E           .           Missing     1
73213   ECD070A     ECQ_F     1 to 12   Range of Values  3578
73214   ECD070A     ECQ_F          13 13 pounds or more     8
73215   ECD070A     ECQ_F        7777           Refused     0
73216   ECD070A     ECQ_F        9999        Don't know    62
73217   ECD070A     ECQ_F           .           Missing     0
73259   ECD070A     ECQ_G     1 to 12   Range of Values  3505
73260   ECD070A     ECQ_G          13 13 pounds or more     3
73261   ECD070A     ECQ_G        7777           Refused     1
73262   ECD070A     ECQ_G        9999        Don't know    72
73263   ECD070A     ECQ_G           .           Missing     0
73305   ECD070A     ECQ_H     1 to 12   Range of Values  3622
73306   ECD070A     ECQ_H          13 13 pounds or more     0
73307   ECD070A     ECQ_H        7777           Refused     0
73308   ECD070A     ECQ_H        9999        Don't know    88
73309   ECD070A     ECQ_H           .           Missing     1
73351   ECD070A     ECQ_I     4 to 10   Range of Values  3436
73352   ECD070A     ECQ_I           3  3 pounds or less    80
73353   ECD070A     ECQ_I          11 11 pounds or more    10
73354   ECD070A     ECQ_I        7777           Refused     2
73355   ECD070A     ECQ_I        9999        Don't know   116
73356   ECD070A     ECQ_I           .           Missing     0
73398   ECD070A     ECQ_J     4 to 10   Range of Values  2926
73399   ECD070A     ECQ_J           3  3 pounds or less    66
73400   ECD070A     ECQ_J          11 11 pounds or more    15
73401   ECD070A     ECQ_J        7777           Refused     0
73402   ECD070A     ECQ_J        9999        Don't know    85
73403   ECD070A     ECQ_J           .           Missing     1
146578  ECD070A     P_ECQ     4 to 10   Range of Values  5056
146579  ECD070A     P_ECQ           3  3 pounds or less   129
146580  ECD070A     P_ECQ          11 11 pounds or more    23
146581  ECD070A     P_ECQ        7777           Refused     0
146582  ECD070A     P_ECQ        9999        Don't know   155
146583  ECD070A     P_ECQ           .           Missing     2

$HOD050
      Variable TableName CodeOrValue ValueDescription Count
86830   HOD050       HOQ     1 to 12  Range of Values  9709
86831   HOD050       HOQ          13       13 or More    70
86832   HOD050       HOQ         777          Refused    12
86833   HOD050       HOQ         999       Don't know    13
86834   HOD050       HOQ           .          Missing   161
86927   HOD050     HOQ_B     1 to 12  Range of Values 10725
86928   HOD050     HOQ_B          13       13 or More    93
86929   HOD050     HOQ_B         777          Refused    19
86930   HOD050     HOQ_B         999       Don't know    28
86931   HOD050     HOQ_B           .          Missing   174
87024   HOD050     HOQ_C     1 to 12  Range of Values  9944
87025   HOD050     HOQ_C          13       13 or more    27
87026   HOD050     HOQ_C         777          Refused    10
87027   HOD050     HOQ_C         999       Don't know     8
87028   HOD050     HOQ_C           .          Missing   133
87120   HOD050     HOQ_D     1 to 12  Range of Values 10150
87121   HOD050     HOQ_D          13       13 or more    69
87122   HOD050     HOQ_D         777          Refused     5
87123   HOD050     HOQ_D         999       Don't know    15
87124   HOD050     HOQ_D           .          Missing   109
87195   HOD050     HOQ_E     1 to 12  Range of Values  9977
87196   HOD050     HOQ_E          13       13 or more    62
87197   HOD050     HOQ_E         777          Refused     4
87198   HOD050     HOQ_E         999       Don't know    12
87199   HOD050     HOQ_E           .          Missing    94
87234   HOD050     HOQ_F     1 to 12  Range of Values 10348
87235   HOD050     HOQ_F          13       13 or more    97
87236   HOD050     HOQ_F         777          Refused    13
87237   HOD050     HOQ_F         999       Don't know    11
87238   HOD050     HOQ_F           .          Missing    68
87264   HOD050     HOQ_G           1                1    63
87265   HOD050     HOQ_G           2                2   241
87266   HOD050     HOQ_G           3                3   924
87267   HOD050     HOQ_G           4                4  1863
87268   HOD050     HOQ_G           5                5  1972
87269   HOD050     HOQ_G           6                6  1709
87270   HOD050     HOQ_G           7                7  1117
87271   HOD050     HOQ_G           8                8   776
87272   HOD050     HOQ_G           9                9   410
87273   HOD050     HOQ_G          10               10   309
87274   HOD050     HOQ_G          11               11   162
87275   HOD050     HOQ_G          12               12    67
87276   HOD050     HOQ_G          13       13 or more    90
87277   HOD050     HOQ_G         777          Refused     4
87278   HOD050     HOQ_G         999       Don't know     0
87279   HOD050     HOQ_G           .          Missing    49
87286   HOD050     HOQ_H           1                1    88
87287   HOD050     HOQ_H           2                2   202
87288   HOD050     HOQ_H           3                3   683
87289   HOD050     HOQ_H           4                4  1613
87290   HOD050     HOQ_H           5                5  2093
87291   HOD050     HOQ_H           6                6  1922
87292   HOD050     HOQ_H           7                7  1272
87293   HOD050     HOQ_H           8                8   853
87294   HOD050     HOQ_H           9                9   574
87295   HOD050     HOQ_H          10               10   343
87296   HOD050     HOQ_H          11               11   197
87297   HOD050     HOQ_H          12               12   108
87298   HOD050     HOQ_H          13       13 or more    85
87299   HOD050     HOQ_H         777          Refused    16
87300   HOD050     HOQ_H         999       Don't know     5
87301   HOD050     HOQ_H           .          Missing   121
87308   HOD050     HOQ_I           1                1    49
87309   HOD050     HOQ_I           2                2   204
87310   HOD050     HOQ_I           3                3   831
87311   HOD050     HOQ_I           4                4  1810
87312   HOD050     HOQ_I           5                5  2071
87313   HOD050     HOQ_I           6                6  1728
87314   HOD050     HOQ_I           7                7  1130
87315   HOD050     HOQ_I           8                8   773
87316   HOD050     HOQ_I           9                9   434
87317   HOD050     HOQ_I          10               10   323
87318   HOD050     HOQ_I          11               11   133
87319   HOD050     HOQ_I          12               12    71
87320   HOD050     HOQ_I          13       13 or more    51
87321   HOD050     HOQ_I         777          Refused    34
87322   HOD050     HOQ_I         999       Don't know     0
87323   HOD050     HOQ_I           .          Missing   329
87330   HOD050     HOQ_J           1                1    78
87331   HOD050     HOQ_J           2                2   198
87332   HOD050     HOQ_J           3                3   713
87333   HOD050     HOQ_J           4                4  1693
87334   HOD050     HOQ_J           5                5  1848
87335   HOD050     HOQ_J           6                6  1562
87336   HOD050     HOQ_J           7                7  1115
87337   HOD050     HOQ_J           8                8   670
87338   HOD050     HOQ_J           9                9   415
87339   HOD050     HOQ_J          10               10   276
87340   HOD050     HOQ_J          11               11    79
87341   HOD050     HOQ_J          12               12    51
87342   HOD050     HOQ_J          13       13 or more    47
87343   HOD050     HOQ_J         777          Refused     8
87344   HOD050     HOQ_J         999       Don't know    27
87345   HOD050     HOQ_J           .          Missing   474

$HSQ580
      Variable TableName CodeOrValue ValueDescription Count
90108   HSQ580       HSQ     1 to 12  Range of Values   238
90109   HSQ580       HSQ          77          Refused     0
90110   HSQ580       HSQ          99       Don't know     8
90111   HSQ580       HSQ           .          Missing  8586
90160   HSQ580     HSQ_B     1 to 12  Range of Values   272
90161   HSQ580     HSQ_B          77          Refused     0
90162   HSQ580     HSQ_B          99       Don't know     2
90163   HSQ580     HSQ_B           .          Missing 10108
90212   HSQ580     HSQ_C     1 to 12  Range of Values   233
90213   HSQ580     HSQ_C          77          Refused     0
90214   HSQ580     HSQ_C          99       Don't know     3
90215   HSQ580     HSQ_C           .          Missing  9299
90264   HSQ580     HSQ_D     1 to 12  Range of Values   235
90265   HSQ580     HSQ_D          77          Refused     0
90266   HSQ580     HSQ_D          99       Don't know     4
90267   HSQ580     HSQ_D           .          Missing  9201
90324   HSQ580     HSQ_E     1 to 12  Range of Values   242
90325   HSQ580     HSQ_E          77          Refused     0
90326   HSQ580     HSQ_E          99       Don't know     5
90327   HSQ580     HSQ_E           .          Missing  9060
90384   HSQ580     HSQ_F     1 to 12  Range of Values   338
90385   HSQ580     HSQ_F          77          Refused     0
90386   HSQ580     HSQ_F          99       Don't know     3
90387   HSQ580     HSQ_F           .          Missing  9494
90444   HSQ580     HSQ_G     1 to 12  Range of Values   261
90445   HSQ580     HSQ_G          77          Refused     0
90446   HSQ580     HSQ_G          99       Don't know     0
90447   HSQ580     HSQ_G           .          Missing  8695
90484   HSQ580     HSQ_H           1                1    39
90485   HSQ580     HSQ_H           2                2    38
90486   HSQ580     HSQ_H           3                3    28
90487   HSQ580     HSQ_H           4                4    20
90488   HSQ580     HSQ_H           5                5    13
90489   HSQ580     HSQ_H           6                6    29
90490   HSQ580     HSQ_H           7                7    19
90491   HSQ580     HSQ_H           8                8    11
90492   HSQ580     HSQ_H           9                9    11
90493   HSQ580     HSQ_H          10               10    13
90494   HSQ580     HSQ_H          11               11    17
90495   HSQ580     HSQ_H          12               12    18
90496   HSQ580     HSQ_H          77          Refused     0
90497   HSQ580     HSQ_H          99       Don't know     4
90498   HSQ580     HSQ_H           .          Missing  9162
90535   HSQ580     HSQ_I           1                1    42
90536   HSQ580     HSQ_I           2                2    34
90537   HSQ580     HSQ_I           3                3    22
90538   HSQ580     HSQ_I           4                4    29
90539   HSQ580     HSQ_I           5                5    15
90540   HSQ580     HSQ_I           6                6    29
90541   HSQ580     HSQ_I           7                7    13
90542   HSQ580     HSQ_I           8                8    12
90543   HSQ580     HSQ_I           9                9     7
90544   HSQ580     HSQ_I          10               10    10
90545   HSQ580     HSQ_I          11               11     5
90546   HSQ580     HSQ_I          12               12    10
90547   HSQ580     HSQ_I          77          Refused     0
90548   HSQ580     HSQ_I          99       Don't know     2
90549   HSQ580     HSQ_I           .          Missing  8935
90586   HSQ580     HSQ_J           1                1    53
90587   HSQ580     HSQ_J           2                2    35
90588   HSQ580     HSQ_J           3                3    23
90589   HSQ580     HSQ_J           4                4    33
90590   HSQ580     HSQ_J           5                5    11
90591   HSQ580     HSQ_J           6                6    40
90592   HSQ580     HSQ_J           7                7    15
90593   HSQ580     HSQ_J           8                8    16
90594   HSQ580     HSQ_J           9                9     7
90595   HSQ580     HSQ_J          10               10    12
90596   HSQ580     HSQ_J          11               11    18
90597   HSQ580     HSQ_J          12               12     6
90598   HSQ580     HSQ_J          77          Refused     0
90599   HSQ580     HSQ_J          99       Don't know     5
90600   HSQ580     HSQ_J           .          Missing  8092

$KID221
       Variable TableName                         CodeOrValue   ValueDescription
95207    KID221  L11PSA_C Age at diagnosis of prostate cancer Value was recorded
95208    KID221  L11PSA_C                                 777            Refused
95209    KID221  L11PSA_C                                 999         Don't know
95210    KID221  L11PSA_C                           < blank >            Missing
162578   KID221     PSA_D                                   .                  .
162579   KID221     PSA_D                                  54                 54
162580   KID221     PSA_D                                  58                 58
162581   KID221     PSA_D                                  59                 59
162582   KID221     PSA_D                                  60                 60
162583   KID221     PSA_D                                  61                 61
162584   KID221     PSA_D                                  62                 62
162585   KID221     PSA_D                                  63                 63
162586   KID221     PSA_D                                  64                 64
162587   KID221     PSA_D                                  65                 65
162588   KID221     PSA_D                                  66                 66
162589   KID221     PSA_D                                  67                 67
162590   KID221     PSA_D                                  68                 68
162591   KID221     PSA_D                                  69                 69
162592   KID221     PSA_D                                  70                 70
162593   KID221     PSA_D                                  71                 71
162594   KID221     PSA_D                                  72                 72
162595   KID221     PSA_D                                  73                 73
162596   KID221     PSA_D                                  75                 75
162597   KID221     PSA_D                                  76                 76
162598   KID221     PSA_D                                  77                 77
162599   KID221     PSA_D                                  78                 78
162600   KID221     PSA_D                                  79                 79
162601   KID221     PSA_D                                  80                 80
162602   KID221     PSA_D                                  81                 81
162603   KID221     PSA_D                       85 or greater      85 or greater
162604   KID221     PSA_D                           < blank >            Missing
162725   KID221     PSA_F                             8 to 85    Range of Values
162726   KID221     PSA_F                                 777            Refused
162727   KID221     PSA_F                                 999         Don't know
162728   KID221     PSA_F                                   .            Missing
       Count
95207     56
95208      0
95209      0
95210   1451
162578     0
162579     0
162580     0
162581     0
162582     0
162583     0
162584     0
162585     0
162586     0
162587     0
162588     0
162589     0
162590     0
162591     0
162592     0
162593     0
162594     0
162595     0
162596     0
162597     0
162598     0
162599     0
162600     0
162601     0
162602     0
162603     3
162604     0
162725    95
162726     0
162727     0
162728  1881

What to do about these?

The last example is of particular concern, because the KID221 variable clearly means different things in different tables. Otherwise, these all look like legitimate issues, and there are not many of them, so a possible workaround is to maintain an explicit list of such variables and handle them while creating the codebook. The least intrusive way would be to just insert a row with value description "Range of Values", and perhaps drop the value descriptions which can be coerced to numeric.

Codebook conversion problems

Ideally, each codebook (as returned by nhanesCodebook() should contain one element for each variable in the table, where each element is a list containing information about that variable. This information currently consists of the ‘SAS Label’, ‘English Text’, and ‘Target’, as recorded in the documentation files, along with a translation table with descriptions of the codes used in the data.

The following functions checks to see if a given codebook satisfies these expectations. In addition to checking for the presence of a translation table, it flags cases where a potentially numeric variable has unusual codes, accounting for some common non-response codes and thresholding codes.

acceptable <-
    c("Range of Values", "Missing", "No response", "Refused", "Refuse",
      "SP refused", "Could not obtain", "No Lab Result", "No lab specimen", 
      "Don't know", "Don't  Know", "Cannot be assessed",
      "Calculation cannot be determined", "Since birth",
      "Fill Value of Limit of Detection", "Below Limit of Detection",
      "None", "Never")
agelimits <-
    c("80 years or older", "85 years or older",
      ">= 80 years of age", ">= 85 years of age", "80 years of age and over",
      "9 or younger", "9 years or younger",
      "12 years or younger ", "14 years or younger",
      "45 years or older", "14 years or under",
      "60 years or older")
var_status <- function(v, cb) {
    x <- cb[[v]][[v]]
    if (is.null(x)) return(NA) # no info, usually for SEQN
    probablyNumeric <- "Range of Values" %in% x$Value.Description
    if (!probablyNumeric) return(TRUE) # OK - at least for now
    ok <- all(tolower(x$Value.Description) %in% tolower(c(acceptable, agelimits)))
    ok
}
find_conversion_problems <- function(nh_table)
{
    cb <- nhanesCodebook(nh_table)
    cb_status <- vapply(names(cb), var_status, logical(1), cb = cb)
    if (all(is.na(cb_status))) "INVALID CODEBOOK" # the whole table is problematic ?
    else lapply(cb[ !is.na(cb_status) & !cb_status ],
                function(x) x[[length(x)]][1:3])
}

These are used below to find potential problems in importing codebooks.

tables <- nhanesQuery("select TableName from Metadata.QuestionnaireDescriptions")$TableName
status <- lapply(tables, find_conversion_problems)
Error in cb[[v]][[v]]: subscript out of bounds
names(status) <- tables
Error: object 'status' not found
keep <- sapply(status, length) > 0 # tables with some issues
Error in eval(expr, envir, enclos): object 'status' not found
status <- status[keep]
Error in eval(expr, envir, enclos): object 'status' not found
tables <- tables[keep]
Error in eval(expr, envir, enclos): object 'keep' not found

Tables with no useful codebook in the database

no_codebook <- sapply(status, identical, "INVALID CODEBOOK")
Error in eval(expr, envir, enclos): object 'status' not found
cat(format(tables[no_codebook]), fill = TRUE)
Error in eval(expr, envir, enclos): object 'no_codebook' not found

ALB_CR_G is a known example where there are no translation tables; this is not a problem because all variables are numeric and do not require translation. Other instances should be investigated.

Tables with unexpected value descriptions

Most of the remaining ‘problems’ arise from special numeric codes, which are perhaps too many to deal with systematically, but do need to be accounted for during analysis. They are listed below for reference.

labels_df <- status[!no_codebook] |>
    do.call(what = c) |> do.call(what = rbind)
Error in eval(expr, envir, enclos): object 'status' not found
## keep only value and description
labels_df <- labels_df[1:2]
Error in eval(expr, envir, enclos): object 'labels_df' not found

Next, we count the number of variables each description occurs in, and sort by frequency.

labels_df <- subset(labels_df, Value.Description != "Range of Values")
Error in eval(expr, envir, enclos): object 'labels_df' not found
labels_split <- split(labels_df, ~ Value.Description)
Error in eval(expr, envir, enclos): object 'labels_df' not found
labels_summary <-
    lapply(labels_split,
           function(d) with(d,
                            data.frame(Desc = substring(as.character(Value.Description)[[1]],
                                                        1, 45),
                                       Count = length(Value.Description),
                                       Codes = sort(unique(Code.or.Value))
                                                 |> paste(collapse = "/")
                                                 |> substring(1, 30)))) |>
    do.call(what = rbind)
Error in eval(expr, envir, enclos): object 'labels_split' not found
options(width = 200)
rownames(labels_summary) <- NULL
Error: object 'labels_summary' not found
labels_summary[order(labels_summary$Count, decreasing = TRUE), ]
Error in eval(expr, envir, enclos): object 'labels_summary' not found