Accessing NHANES data locally


In its default mode of operation, functions in the nhanesA package scrape data directly from the CDC website each time they are invoked. The advantage is simplicity; users only need to install the nhanesA package without any additional setup. However, the response time is contingent upon internet speed and the size of the requested data.

Starting with version 0.8.x, nhanesA offers two alternatives: using a prebuilt SQL database and using a mirror.

Using SQL database

Functions in the nhanesA package can obtain (most) data from a suitably configured Microsoft SQL Server database instead of accessing the CDC website directly. The easiest way to obtain such a database is to use the docker image created as part of the Epiconductor project. This docker image includes versions of R and RStudio, and is configured in a way that causes nhanesA to use the database when it is run inside the docker instance.

It is also possible to configure nhanesA to use a SQL database when running outside a docker instance, provided the machine has access to the database, which could be running in a docker instance on the same machine, or on another machine in the local network. To do so, the following environment variables need to be define prior to loading the nhanesA package:

  • EPICONDUCTOR_CONTAINER_VERSION (e.g., v0.12.0)
  • EPICONDUCTOR_COLLECTION_DATE (e.g., 2023-11-21)
  • EPICONDUCTOR_DB_DRIVER (e.g., FreeTDS on Linux)
  • EPICONDUCTOR_DB_SERVER (e.g., localhost)
  • EPICONDUCTOR_DB_PORT (e.g., 1433)

The first two are for information, and need not actually match the version of the database. They indicate the date on which a snapshot of the NHANES data was collected from the CDC website, and are defined suitably when running inside the docker image. However, they must be specified explicitly when trying to connect to the database from an instance of R running outside docker.

The last three environment variables define the details of how to connect to the database. For details, see the DBI and odbc packages (the latter is the backend that allows R to communicate with a Microsoft SQL server).

Usage

Once a database is successfully configured (which is most easily done by using the docker version), the nhanesA package should ideally behave similarly whether or not a database is being used. When a database is successfully found on startup, the package sets an option called use.db to TRUE.

library(nhanesA)
nhanesOptions()
$use.db
[1] TRUE

Even in this case, it is possible to pause use of the database and revert to downloading from the CDC website by setting

nhanesOptions(use.db = FALSE, log.access = TRUE)

The log.access option, if set, causes a message to be printed every time a web resource is accessed.

With these settings, we get

bpq_b_web <- nhanes("BPQ_B")
Downloading: https://wwwn.cdc.gov/Nchs/Nhanes/2001-2002/BPQ_B.XPT

On the other hand, if we use the database, we get

nhanesOptions(use.db = TRUE)
bpq_b_db <- nhanes("BPQ_B")

The two versions have minor differences: The order of rows and columns may be different, and categorical variables may be represented either as factors of character strings. However, as long as the data has not been updated on the NHANES website since it was downloaded for inclusion in the database, the contents should be identical.

str(bpq_b_web[1:10])
'data.frame':	6634 obs. of  10 variables:
 $ SEQN   : num  9966 9967 9968 9969 9970 ...
 $ BPQ010 : Factor w/ 7 levels "Less than 6 months ago,",..: 1 1 1 1 2 2 1 1 3 1 ...
 $ BPQ020 : Factor w/ 3 levels "Yes","No","Don't know": 2 2 1 2 2 1 2 2 2 2 ...
 $ BPQ030 : Factor w/ 3 levels "Yes","No","Don't know": NA NA 1 NA NA 2 NA NA NA NA ...
 $ BPQ040A: Factor w/ 3 levels "Yes","No","Don't know": NA NA 1 NA NA 1 NA NA NA NA ...
 $ BPQ040B: Factor w/ 3 levels "Yes","No","Don't know": NA NA 2 NA NA 1 NA NA NA NA ...
 $ BPQ040C: Factor w/ 3 levels "Yes","No","Don't know": NA NA 1 NA NA 1 NA NA NA NA ...
 $ BPQ040D: Factor w/ 3 levels "Yes","No","Don't know": NA NA 2 NA NA 1 NA NA NA NA ...
 $ BPQ040E: Factor w/ 3 levels "Yes","No","Don't know": NA NA 2 NA NA 1 NA NA NA NA ...
 $ BPQ040F: Factor w/ 3 levels "Yes","No","Don't know": NA NA 2 NA NA 2 NA NA NA NA ...
str(bpq_b_db[1:10])
'data.frame':	6634 obs. of  10 variables:
 $ SEQN   : int  9975 10025 10060 10074 10077 10093 10410 10542 10592 10593 ...
 $ BPQ010 : chr  "Less than 6 months ago" "Less than 6 months ago" "Less than 6 months ago" "Less than 6 months ago" ...
 $ BPQ020 : chr  "No" "No" "No" "No" ...
 $ BPQ030 : chr  NA NA NA NA ...
 $ BPQ040A: chr  NA NA NA NA ...
 $ BPQ040B: chr  NA NA NA NA ...
 $ BPQ040C: chr  NA NA NA NA ...
 $ BPQ040D: chr  NA NA NA NA ...
 $ BPQ040E: chr  NA NA NA NA ...
 $ BPQ040F: chr  NA NA NA NA ...

Using a local mirror

A conceptually simple alternative that also avoids repetitive downloads from the CDC website is to maintain a local mirror from which the data and documentation files can be retrieved as needed.

As noted here, data and documentation URLs for a particular table are determined by the table’s name and the cycle it represents. For example, the URLs for table DEMO_C, which is from cycle 3, i.e., 2003-2004, would be

It is possible to change the “base” of the server from where nhanesA tries to download these files by setting an environment variable called NHANES_TABLE_BASE, which defaults to the value "https://wwwn.cdc.gov".

The steps needed to create such a mirror is beyond the scope of this document, but tools such as wget, or even the R function download.file() in conjunction with the list of relevant URLs obtained using nhanesManifest(), may be used to download all files locally. Note that just downloading the files is not sufficient, and they must also be made available through a HTTP server running locally.

Dynamic caching using httpuv and BiocFileCache

Both the database and local mirroring options can get outdated when CDC releases new files or updates old ones. The BiocFileCache package can cache downloaded files locally in a persistent manner, updating them automatically when the source file has been updated. The experimental cachehttp package uses the BiocFileCache package in conjunction with the httpuv package to run a local server that downloads files from the CDC website the first time they are requested, but uses the cache for subsequent requests.

To use this package, first install it using

BiocManager::install("BiocFileCache")
remotes::install_github("ccb-hms/cachehttp")

Then, run the following in a separate R session.

require(cachehttp)
add_cache("cdc", "https://wwwn.cdc.gov",
          fun = function(x) {
              x <- tolower(x)
              endsWith(x, ".htm") || endsWith(x, ".xpt")
          })
s <- start_cache(host = "0.0.0.0", port = 8080,
                 static_path = BiocFileCache::bfccache(BiocFileCache::BiocFileCache()))
## stopServer(s) # to stop the httpuv server

This session must be kept active for the server to work. It can even run on a different machine, as long as it is accessible via the specified port, and does not require the nhanesA package to work.

While the server is running, we can set (in a different R session)

Sys.setenv(NHANES_TABLE_BASE = "http://127.0.0.1:8080/cdc")

(changing host IP and port as necessary) to use this server instead of the primary CDC website to serve XPT and htm files. Although the each file is downloaded from the CDC website the first time it is requested, subsequent downloads should be faster, as indicated by the elapsed times in the following code.

nhanesOptions(use.db = FALSE, log.access = TRUE)
system.time(foo <- nhanes("DEMO"))
Downloading: http://127.0.0.1:8080/cdc/Nchs/Nhanes/1999-2000/DEMO.XPT
   user  system elapsed 
  2.237   0.112   9.991 
system.time(foo <- nhanes("DEMO"))
Downloading: http://127.0.0.1:8080/cdc/Nchs/Nhanes/1999-2000/DEMO.XPT
   user  system elapsed 
  2.327   0.109   2.954