Week 4: Preparing Data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data

This week we are going to start learning how to get to the cleaned data we have been using for the past two weeks.

Note that this week the data is in the .RData format. The .RData format is specific to R and allows us to easily dump multiple objects to a file exactly as the are inside of our R session. This format is not recommended for research as it limits analyses to only using R and could potentially not be compatible with a future R version. However, for today it will allow us to jump right into the data cleaning process.

The nhanes_beheshti_data.RData file contains 3 separate dataframes, base_df_d, base_df_e, and base_df_f. They each contain the equivalent set of columns from the D, E, and F NHANES data files for 2005-2006, 2007-2008, and 2009-2010, respectively.

Let’s load and take a look at the data.

load("nhanes_beheshti_data.RData")

summary(base_df_d)

      SEQN          RIDAGEYR    RIAGENDR           RIDRETH1        
 Min.   :31127   Min.   : 0   Length:10348       Length:10348      
 1st Qu.:33714   1st Qu.: 9   Class :character   Class :character  
 Median :36301   Median :19   Mode  :character   Mode  :character  
 Mean   :36301   Mean   :28                                        
 3rd Qu.:38887   3rd Qu.:45                                        
 Max.   :41474   Max.   :85                                        
                                                                   
   DMDBORN             INDFMPIR        SDMVPSU         SDMVSTRA    
 Length:10348       Min.   :0.000   Min.   :1.000   Min.   :44.00  
 Class :character   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:47.00  
 Mode  :character   Median :1.950   Median :2.000   Median :51.00  
                    Mean   :2.369   Mean   :1.504   Mean   :50.86  
                    3rd Qu.:3.708   3rd Qu.:2.000   3rd Qu.:55.00  
                    Max.   :5.000   Max.   :2.000   Max.   :58.00  
                    NA's   :534                                    
    WTINT2YR         WTMEC2YR        OHXDECAY           OHXREST         
 Min.   :  1225   Min.   :     0   Length:10348       Length:10348      
 1st Qu.:  7247   1st Qu.:  7019   Class :character   Class :character  
 Median : 18971   Median : 18037   Mode  :character   Mode  :character  
 Mean   : 28181   Mean   : 28181                                        
 3rd Qu.: 39383   3rd Qu.: 40205                                        
 Max.   :152162   Max.   :156152                                        
                                                                        
     LBXGLU         WTSAF2YR          LBXGH            BMXBMI      
 Min.   : 45.0   Min.   :     0   Min.   : 3.800   Min.   : 11.74  
 1st Qu.: 89.0   1st Qu.: 14092   1st Qu.: 5.000   1st Qu.: 19.29  
 Median : 96.0   Median : 50932   Median : 5.300   Median : 24.31  
 Mean   :101.7   Mean   : 72673   Mean   : 5.423   Mean   : 25.09  
 3rd Qu.:104.0   3rd Qu.:107521   3rd Qu.: 5.600   3rd Qu.: 29.51  
 Max.   :418.0   Max.   :327394   Max.   :15.600   Max.   :130.21  
 NA's   :7220    NA's   :6996     NA's   :3855     NA's   :1399    
   Begin.Year      EndYear    
 Min.   :2005   Min.   :2006  
 1st Qu.:2005   1st Qu.:2006  
 Median :2005   Median :2006  
 Mean   :2005   Mean   :2006  
 3rd Qu.:2005   3rd Qu.:2006  
 Max.   :2005   Max.   :2006

summary(base_df_e)

      SEQN          RIDAGEYR       RIAGENDR           RIDRETH1        
 Min.   :41475   Min.   : 0.00   Length:10149       Length:10149      
 1st Qu.:44012   1st Qu.: 9.00   Class :character   Class :character  
 Median :46549   Median :29.00   Mode  :character   Mode  :character  
 Mean   :46549   Mean   :32.97                                        
 3rd Qu.:49086   3rd Qu.:55.00                                        
 Max.   :51623   Max.   :80.00                                        
                                                                      
   DMDBORN2            INDFMPIR        SDMVPSU         SDMVSTRA    
 Length:10149       Min.   :0.000   Min.   :1.000   Min.   :59.00  
 Class :character   1st Qu.:0.960   1st Qu.:1.000   1st Qu.:62.00  
 Mode  :character   Median :1.820   Median :2.000   Median :65.00  
                    Mean   :2.284   Mean   :1.511   Mean   :65.51  
                    3rd Qu.:3.570   3rd Qu.:2.000   3rd Qu.:69.00  
                    Max.   :5.000   Max.   :2.000   Max.   :74.00  
                    NA's   :894                                    
    WTINT2YR         WTMEC2YR        OHXDECAY           OHXREST         
 Min.   :  2359   Min.   :     0   Length:10149       Length:10149      
 1st Qu.: 11087   1st Qu.: 10976   Class :character   Class :character  
 Median : 20519   Median : 20347   Mode  :character   Mode  :character  
 Mean   : 29277   Mean   : 29277                                        
 3rd Qu.: 35545   3rd Qu.: 35669                                        
 Max.   :186296   Max.   :192771                                        
                                                                        
     LBXGLU         WTSAF2YR          LBXGH           BMXBMI     
 Min.   : 38.0   Min.   :     0   Min.   : 2.00   Min.   :12.50  
 1st Qu.: 94.0   1st Qu.: 26035   1st Qu.: 5.20   1st Qu.:19.97  
 Median :101.0   Median : 54827   Median : 5.50   Median :25.16  
 Mean   :109.8   Mean   : 74987   Mean   : 5.66   Mean   :25.71  
 3rd Qu.:111.0   3rd Qu.: 94336   3rd Qu.: 5.80   3rd Qu.:30.08  
 Max.   :584.0   Max.   :404656   Max.   :15.20   Max.   :73.43  
 NA's   :7039    NA's   :6834     NA's   :3722    NA's   :1288   
   Begin.Year      EndYear    
 Min.   :2007   Min.   :2008  
 1st Qu.:2007   1st Qu.:2008  
 Median :2007   Median :2008  
 Mean   :2007   Mean   :2008  
 3rd Qu.:2007   3rd Qu.:2008  
 Max.   :2007   Max.   :2008

summary(base_df_f)

      SEQN          RIDAGEYR      RIAGENDR           RIDRETH1        
 Min.   :51624   Min.   : 0.0   Length:10537       Length:10537      
 1st Qu.:54258   1st Qu.:10.0   Class :character   Class :character  
 Median :56892   Median :29.0   Mode  :character   Mode  :character  
 Mean   :56892   Mean   :32.6                                        
 3rd Qu.:59526   3rd Qu.:53.0                                        
 Max.   :62160   Max.   :80.0                                        
                                                                     
   DMDBORN2            INDFMPIR        SDMVPSU         SDMVSTRA    
 Length:10537       Min.   :0.000   Min.   :1.000   Min.   :75.00  
 Class :character   1st Qu.:0.920   1st Qu.:1.000   1st Qu.:78.00  
 Mode  :character   Median :1.690   Median :2.000   Median :81.00  
                    Mean   :2.229   Mean   :1.549   Mean   :81.45  
                    3rd Qu.:3.490   3rd Qu.:2.000   3rd Qu.:85.00  
                    Max.   :5.000   Max.   :3.000   Max.   :89.00  
                    NA's   :996                                    
    WTINT2YR         WTMEC2YR        OHXDECAY           OHXREST         
 Min.   :  3280   Min.   :     0   Length:10537       Length:10537      
 1st Qu.: 12115   1st Qu.: 12125   Class :character   Class :character  
 Median : 20029   Median : 19952   Mode  :character   Mode  :character  
 Mean   : 28656   Mean   : 28656                                        
 3rd Qu.: 35051   3rd Qu.: 35668                                        
 Max.   :153810   Max.   :158147                                        
                                                                        
     LBXGLU         WTSAF2YR          LBXGH            BMXBMI     
 Min.   : 36.0   Min.   :     0   Min.   : 4.000   Min.   :12.58  
 1st Qu.: 91.0   1st Qu.: 29569   1st Qu.: 5.200   1st Qu.:20.13  
 Median : 98.0   Median : 49447   Median : 5.500   Median :25.32  
 Mean   :104.8   Mean   : 70436   Mean   : 5.673   Mean   :25.92  
 3rd Qu.:107.0   3rd Qu.: 90248   3rd Qu.: 5.800   3rd Qu.:30.35  
 Max.   :375.0   Max.   :337219   Max.   :16.400   Max.   :84.87  
 NA's   :7151    NA's   :6956     NA's   :3607     NA's   :1125   
   Begin.Year      EndYear    
 Min.   :2009   Min.   :2010  
 1st Qu.:2009   1st Qu.:2010  
 Median :2009   Median :2010  
 Mean   :2009   Mean   :2010  
 3rd Qu.:2009   3rd Qu.:2010  
 Max.   :2009   Max.   :2010

Combining data

In the reading, we saw how to combine multiple sets of observations by appending datasets. However, before combining multiple datasets we need to make sure corresponding columns between years can be combined.

NHANES will often change how it encodes or records different variables from year to year. Our first red flag are any columns whose name has changed. Let’s take a look at the column names.

colnames(base_df_d)

 [1] "SEQN"       "RIDAGEYR"   "RIAGENDR"   "RIDRETH1"   "DMDBORN"   
 [6] "INDFMPIR"   "SDMVPSU"    "SDMVSTRA"   "WTINT2YR"   "WTMEC2YR"  
[11] "OHXDECAY"   "OHXREST"    "LBXGLU"     "WTSAF2YR"   "LBXGH"     
[16] "BMXBMI"     "Begin.Year" "EndYear"

colnames(base_df_e)

 [1] "SEQN"       "RIDAGEYR"   "RIAGENDR"   "RIDRETH1"   "DMDBORN2"  
 [6] "INDFMPIR"   "SDMVPSU"    "SDMVSTRA"   "WTINT2YR"   "WTMEC2YR"  
[11] "OHXDECAY"   "OHXREST"    "LBXGLU"     "WTSAF2YR"   "LBXGH"     
[16] "BMXBMI"     "Begin.Year" "EndYear"

colnames(base_df_f)

 [1] "SEQN"       "RIDAGEYR"   "RIAGENDR"   "RIDRETH1"   "DMDBORN2"  
 [6] "INDFMPIR"   "SDMVPSU"    "SDMVSTRA"   "WTINT2YR"   "WTMEC2YR"  
[11] "OHXDECAY"   "OHXREST"    "LBXGLU"     "WTSAF2YR"   "LBXGH"     
[16] "BMXBMI"     "Begin.Year" "EndYear"

We can also check for equality directly:

colnames(base_df_d)[colnames(base_df_d) != colnames(base_df_e)]

[1] "DMDBORN"

colnames(base_df_e)[colnames(base_df_e) != colnames(base_df_f)]

character(0)

colnames(base_df_f)[colnames(base_df_f) != colnames(base_df_d)]

[1] "DMDBORN2"

It looks like DMDBORN changed after 2006 to DMDBORN2. If we check the demographic variable definitions for the three collection cycles DEMO_D, DEMO_E, and DEMO_F we can see that the encoding has changed over time.

To continue the analysis, let’s also convert these dataframes into tibbles so we can take advantage of Tidyverse functionality.

tib_d <- as_tibble(base_df_d)
tib_e <- as_tibble(base_df_e)
tib_f <- as_tibble(base_df_f)

Finally, let’s also view the metadata of the columns.

DT::datatable(metadata)

Exercise

Resolve any changes in variable coding between the datasets. Use your own judgement on the best way to resolve any issues (doing nothing might also be a solution).

# Step 1: If needed, change things to have the same sets of values
# TODO your code here

# Step 2: Rename the DMDBORN2 columns to match between the datasets
tib_e <- rename(tib_e, DMDBORN = DMDBORN2)
tib_f <- rename(tib_f, DMDBORN = DMDBORN2)

Check table 1b to see how the authors chose to handle this in Beheshti et. al. 2021. Do you think there were any other factors which went into their decision?

Combining the data

Now we can combine the data together.

# Combine all years together
tib_all <- bind_rows(tib_d, tib_e, tib_f)

Data cleaning pipeline

We can choose to leave the variables with their NHANES column names or change them to be a bit more human-readable. Either way, we also want to convert each variable to it’s appropriate type or category.

Numeric variables

Let’s start by converting numeric columns:

# Set numeric columns to be numeric
tib_all <- mutate(tib_all,
                   RIDAGEYR = as.numeric(RIDAGEYR),
                   LBXGLU = as.numeric(LBXGLU),
                   LBXGH = as.numeric(LBXGH),
                   BMXBMI = as.numeric(BMXBMI))

Categorical variables

Many of the NHANES values are categorical data, but right now are stored as text. We can check what values exist by converting them to factors before making the decision of how to handle them in the analysis:

tib_all |>
  mutate(across(c(RIAGENDR, RIDRETH1, DMDBORN), as.factor), .keep = "none") |>
  summary()

   RIAGENDR                               RIDRETH1    
 Female:15633   Mexican American              : 7388  
 Male  :15401   Non-Hispanic Black            : 6878  
                Non-Hispanic White            :12463  
                Other Hispanic                : 2683  
                Other Race - Including Multi-R: 1622  
                                                      
                                                      
                           DMDBORN     
 "Born in 50 US States or Washi:25732  
 Born Elsewhere                :  588  
 Born in Mexico                : 2511  
 Born in Other Non-Spanish Spea: 1065  
 Born in Other Spanish Speaking: 1120  
 Don't Know                    :    2  
 Refused                       :   16

It looks like one of the options for DMDBORN has a lingering quote character. This can happen due to irregularities in how NHANES data is presented or small mistakes in data processing. We use the gsub function to replace all instances of double quotes with the empty string in the column.

tib_all <- mutate(tib_all,
                   RIAGENDR = as.factor(RIAGENDR),
                   RIDRETH1 = as.factor(RIDRETH1),
                   DMDBORN = gsub("\"", "", DMDBORN), # Remove quotes
                   DMDBORN = as.factor(DMDBORN),
                   OHXDECAY = (OHXDECAY == "Yes"),
                   OHXREST = (OHXREST == "Yes"))

New variables

We need to create the variables mentioned in the paper are not directly present in NHANES. For some variables such as country of birth, we could store the categories in DMDBorn but this way we can also keep the original distribution.

# Set columns to categories as in the paper
tib_all <- mutate(tib_all,

                    age.cat = cut(
                      RIDAGEYR,
                      breaks = c(13, 15, 18, 100),
                      include.lowest = TRUE,
                      labels = c("13-15", "16-18", "19+")),
                   
                    plasma.glucose.cat = case_when(
                     LBXGLU < 100 ~ "<100 mg/dl",
                     LBXGLU < 126 ~ ">=100 mg/dl and <126 mg/dl", 
                     LBXGLU >= 126 ~ ">=126 mg/dl",
                     .default = NA),
                   
                   hba1c.cat = case_when(
                     LBXGH < 5.7 ~ "<5.7%",
                     LBXGH >= 5.7 ~ ">=5.7% and <6.5%",
                     LBXGH >= 6.5 ~ ">= 6.5%",
                     .default = NA),
                   
                   bmi.cat = case_when( # If we were doing this from scratch we might also want an underweight category
                     BMXBMI < 25 ~ "Normal", 
                     BMXBMI < 30 ~ "Overweight",
                     BMXBMI >= 30 ~ "Obese",
                     .default = NA), 
                   
                   family.PIR.cat = case_when(
                     INDFMPIR == "PIR value greater than or equa" ~ ">= 1",
                     INDFMPIR == "Value greater than or equal to" ~ ">= 1",
                     as.numeric(INDFMPIR) >= 1 ~ ">=1",
                     as.numeric(INDFMPIR) < 1 ~ "<1",
                     .default = NA),
                   
                   birthplace = case_when(
                     DMDBORN == "Born in 50 US States or Washi" ~ "Within the US",
                     is.na(DMDBORN) ~ NA,
                     DMDBORN == "Don't Know" ~ NA,
                     DMDBORN == "Refused" ~ NA,
                    .default = "Outside the US"),
                   dental.caries = OHXDECAY | OHXREST)

Exercise

Add a diabetes column to the dataset based on the guidelines from Beheshti et. al:

Individuals with an HbA1C of at least 6.5 percent or a plasma-fasting glucose value of at least 126 mg/dl were considered diabetics; prediabetics were those whose HbA1C ranged from 5.7 percent to 6.4 percent and whose fasting plasma glucose remained within 100 to 125 mg/dl; the remaining study participants, who had less than 5.7 percent HbA1C and less than 100 mg/dl fasting plasma glucose, were classified as nondiabetics.

It is recommended to use case_when, but feel free to use any method you want.

*Hint: remember to account for missing data! When should diabetes be NA?

# Add diabetes column
tib_all <- tib_all |> 
          mutate(diabetes = case_when(
            is.na(LBXGH) & is.na(LBXGLU) ~ NA,
           LBXGH >= 6.5 | LBXGLU >= 126 ~ "diabetic",
           LBXGH >= 5.7 | LBXGLU >= 100 ~ "prediabetic",
            .default = "nondiabetic"))

Adding other diabetes data

While this is the data looked at in the paper, let’s take a look at some other diabetes data from the same NHANES cycles.

load("nhanes_diabetes.RData")

This contains 4 variables from the DIQ table:

DT::datatable(metadata_diabetes)

As before, we need to combine the three data cycles together.

# This time let's combine the dataframes first
diabetes_all <- rbind(diabetes_d, diabetes_e, diabetes_f)

# And then convert to a tibble
diabetes_all <- as_tibble(diabetes_all) |>
  mutate(DIQ010 = as.factor(DIQ010),
         DIQ160 = as.factor(DIQ160),
         DIQ050 = as.factor(DIQ050)) |>
  select(-c(Begin.Year, EndYear))

We now need to combine the diabetes data with the rest of the nhanes data, adding in the new columns. This is a more complicated process than simply adding new rows to a dataset, as we need to determine which rows in the diabetes data are referencing the same participant as rows in the main data. This process is called joining or merging. To merge two datasets, we need to decide what to do with columns which appear in one dataset but not the other.

We also need to determine which variable in the dataset we want to use as the key when joining, the column(s) that identifies which rows in the each dataset we want to match. In the dplyr join functions this is referred to as the by argument.

In NHANES, we have the SEQN variable which uniquely identifies each participant in each survey cycle. We can use this as the key. In this case we want to perform join where we maintain all rows from the main dataset, but drop any rows which only exist in the diabetes dataset. Therefore, let’s do a left join (with the main dataset on the left).

tib_all <- left_join(tib_all, diabetes_all, by = "SEQN")
summary(tib_all)

      SEQN          RIDAGEYR       RIAGENDR    
 Min.   :31127   Min.   : 0.00   Female:15633  
 1st Qu.:38885   1st Qu.: 9.00   Male  :15401  
 Median :46644   Median :25.00                 
 Mean   :46644   Mean   :31.18                 
 3rd Qu.:54402   3rd Qu.:51.00                 
 Max.   :62160   Max.   :85.00                 
                                               
                           RIDRETH1                               DMDBORN     
 Mexican American              : 7388   Born Elsewhere                :  588  
 Non-Hispanic Black            : 6878   Born in 50 US States or Washi :25732  
 Non-Hispanic White            :12463   Born in Mexico                : 2511  
 Other Hispanic                : 2683   Born in Other Non-Spanish Spea: 1065  
 Other Race - Including Multi-R: 1622   Born in Other Spanish Speaking: 1120  
                                        Don't Know                    :    2  
                                        Refused                       :   16  
    INDFMPIR        SDMVPSU         SDMVSTRA        WTINT2YR     
 Min.   :0.000   Min.   :1.000   Min.   :44.00   Min.   :  1225  
 1st Qu.:0.960   1st Qu.:1.000   1st Qu.:55.00   1st Qu.: 10194  
 Median :1.820   Median :2.000   Median :65.00   Median : 19871  
 Mean   :2.295   Mean   :1.522   Mean   :66.04   Mean   : 28701  
 3rd Qu.:3.598   3rd Qu.:2.000   3rd Qu.:78.00   3rd Qu.: 36751  
 Max.   :5.000   Max.   :3.000   Max.   :89.00   Max.   :186296  
 NA's   :2424                                                    
    WTMEC2YR       OHXDECAY        OHXREST            LBXGLU     
 Min.   :     0   Mode :logical   Mode :logical   Min.   : 36.0  
 1st Qu.:  9954   FALSE:14316     FALSE:7590      1st Qu.: 91.0  
 Median : 19584   TRUE :4079      TRUE :10805     Median : 98.0  
 Mean   : 28701   NA's :12639     NA's :12639     Mean   :105.4  
 3rd Qu.: 37143                                   3rd Qu.:107.0  
 Max.   :192771                                   Max.   :584.0  
                                                  NA's   :21410  
    WTSAF2YR          LBXGH            BMXBMI         Begin.Year  
 Min.   :     0   Min.   : 2.000   Min.   : 11.74   Min.   :2005  
 1st Qu.: 21695   1st Qu.: 5.100   1st Qu.: 19.76   1st Qu.:2005  
 Median : 51495   Median : 5.400   Median : 24.94   Median :2007  
 Mean   : 72640   Mean   : 5.587   Mean   : 25.58   Mean   :2007  
 3rd Qu.: 98156   3rd Qu.: 5.700   3rd Qu.: 30.01   3rd Qu.:2009  
 Max.   :404656   Max.   :16.400   Max.   :130.21   Max.   :2009  
 NA's   :20786    NA's   :11184    NA's   :3812                   
    EndYear      age.cat      plasma.glucose.cat  hba1c.cat        
 Min.   :2006   13-15: 1805   Length:31034       Length:31034      
 1st Qu.:2006   16-18: 1855   Class :character   Class :character  
 Median :2008   19+  :17719   Mode  :character   Mode  :character  
 Mean   :2008   NA's : 9655                                        
 3rd Qu.:2010                                                      
 Max.   :2010                                                      
                                                                   
   bmi.cat          family.PIR.cat      birthplace        dental.caries  
 Length:31034       Length:31034       Length:31034       Mode :logical  
 Class :character   Class :character   Class :character   FALSE:5820     
 Mode  :character   Mode  :character   Mode  :character   TRUE :12575    
                                                          NA's :12639    
                                                                         
                                                                         
                                                                         
   diabetes                DIQ010          DID040              DIQ160     
 Length:31034       Borderline:  343   Min.   :  1.00   Don't know:   40  
 Class :character   Don't know:   23   1st Qu.: 40.00   No        :18915  
 Mode  :character   No        :27193   Median : 51.00   Yes       :  693  
                    Refused   :    1   Mean   : 59.53   NA's      :11386  
                    Yes       : 2037   3rd Qu.: 62.00                     
                    NA's      : 1437   Max.   :999.00                     
                                       NA's   :28997                      
     DIQ050     
 No     :29004  
 Refused:    1  
 Yes    :  592  
 NA's   : 1437

Let’s also check the number of NA’s in the new columns to make sure the data has matched up.

#Base R way
na_counts <- colSums(is.na(tib_all))
na_counts <- data.frame(na_counts[order(na_counts, decreasing = TRUE)])
colnames(na_counts) <- c("NA Count")
DT::datatable(na_counts)

Unfortunately, it looks DID040 won’t be very useful.

Exploring Other Diabetes Variables

We’ll start by checking the consistency of the DIQ variables. Was there anyone who said they were both diabetic and prediabetic? Here we’re using the table function, which will give us a nice counts table as output. We’ll see some other ways to make tables next week.

table(tib_all[,c("DIQ010", "DIQ160")], exclude = NULL)

            DIQ160
DIQ010       Don't know    No   Yes  <NA>
  Borderline          0     0     0   343
  Don't know          8     7     3     5
  No                 32 18908   690  7563
  Refused             0     0     0     1
  Yes                 0     0     0  2037
  <NA>                0     0     0  1437

Now let’s see how much the diabetes definition by Beheshti agrees with the DIQ variables.

We’ll make a variable with the same categories from the DIQ data. We’ll leave out the borderline, don’t know, and refused responses.

tib_all <- tib_all |> 
  mutate(diabetes.DIQ = case_when(
    DIQ010 == "Yes" ~ "diabetic",
    DIQ160 == "Yes" ~ "prediabetic",
    DIQ160 == "No" | DIQ010 == "No" ~ "nondiabetic"
  ))

table(tib_all[,c("diabetes.DIQ", "diabetes")])

             diabetes
diabetes.DIQ  diabetic nondiabetic prediabetic
  diabetic        1277         130         424
  nondiabetic      494       11352        5238
  prediabetic      115         233         307

Extracting the study population

We need to filter the summary population. Luckily, the paper provides total numbers of participants for most of the filtering steps performed. This makes it much easier for us to check our work.

The total number of individuals who participated in the NHANES from 2005 to 2010 were 31,034 study-participants. Among them, 3,660 were nonedentulous adolescents (aged 13 to 18 years old at the time of screening) who had exam survey weights assigned. From those 3,660 adolescents, data to measure the main outcome variable (dental caries experience) were available for a final sample of 3,346 adolescents, representing the population of 24,386,135 U.S. adolescents after applying the NHANES sample weights, which was described and analyzed in this study.

tib_all <- tib_all |>
  mutate(in.study = (RIDAGEYR >= 13) & (RIDAGEYR <= 18) & !is.na(dental.caries)) # Gets the 3346 with non-NA dental carie variable

all_ado <- tib_all |>
  filter(in.study)

We can now calculate values for adolescents with dental caries experience.

Exercise

Calculate the percent of adolescents with dental caries experience.

# Recall that you can sum boolean variables to get a count
sum(all_ado$dental.caries)/nrow(all_ado)

[1] 0.6007173

Exporting data

Now that you have learned how to extract information from or summarise your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Let’s use write_csv() to save the nhanes data.

write_csv(all_ado, file = "nhanes_processed.csv")