── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Data
This week we are going to start learning how to get to the cleaned data we have been using for the past two weeks.
Note that this week the data is in the .RData format. The .RData format is specific to R and allows us to easily dump multiple objects to a file exactly as the are inside of our R session. This format is not recommended for research as it limits analyses to only using R and could potentially not be compatible with a future R version. However, for today it will allow us to jump right into the data cleaning process.
The nhanes_beheshti_data.RData file contains 3 separate dataframes, base_df_d, base_df_e, and base_df_f. They each contain the equivalent set of columns from the D, E, and F NHANES data files for 2005-2006, 2007-2008, and 2009-2010, respectively.
Let’s load and take a look at the data.
load("nhanes_beheshti_data.RData")
summary(base_df_d)
SEQN RIDAGEYR RIAGENDR RIDRETH1
Min. :31127 Min. : 0 Length:10348 Length:10348
1st Qu.:33714 1st Qu.: 9 Class :character Class :character
Median :36301 Median :19 Mode :character Mode :character
Mean :36301 Mean :28
3rd Qu.:38887 3rd Qu.:45
Max. :41474 Max. :85
DMDBORN INDFMPIR SDMVPSU SDMVSTRA
Length:10348 Min. :0.000 Min. :1.000 Min. :44.00
Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:47.00
Mode :character Median :1.950 Median :2.000 Median :51.00
Mean :2.369 Mean :1.504 Mean :50.86
3rd Qu.:3.708 3rd Qu.:2.000 3rd Qu.:55.00
Max. :5.000 Max. :2.000 Max. :58.00
NA's :534
WTINT2YR WTMEC2YR OHXDECAY OHXREST
Min. : 1225 Min. : 0 Length:10348 Length:10348
1st Qu.: 7247 1st Qu.: 7019 Class :character Class :character
Median : 18971 Median : 18037 Mode :character Mode :character
Mean : 28181 Mean : 28181
3rd Qu.: 39383 3rd Qu.: 40205
Max. :152162 Max. :156152
LBXGLU WTSAF2YR LBXGH BMXBMI
Min. : 45.0 Min. : 0 Min. : 3.800 Min. : 11.74
1st Qu.: 89.0 1st Qu.: 14092 1st Qu.: 5.000 1st Qu.: 19.29
Median : 96.0 Median : 50932 Median : 5.300 Median : 24.31
Mean :101.7 Mean : 72673 Mean : 5.423 Mean : 25.09
3rd Qu.:104.0 3rd Qu.:107521 3rd Qu.: 5.600 3rd Qu.: 29.51
Max. :418.0 Max. :327394 Max. :15.600 Max. :130.21
NA's :7220 NA's :6996 NA's :3855 NA's :1399
Begin.Year EndYear
Min. :2005 Min. :2006
1st Qu.:2005 1st Qu.:2006
Median :2005 Median :2006
Mean :2005 Mean :2006
3rd Qu.:2005 3rd Qu.:2006
Max. :2005 Max. :2006
summary(base_df_e)
SEQN RIDAGEYR RIAGENDR RIDRETH1
Min. :41475 Min. : 0.00 Length:10149 Length:10149
1st Qu.:44012 1st Qu.: 9.00 Class :character Class :character
Median :46549 Median :29.00 Mode :character Mode :character
Mean :46549 Mean :32.97
3rd Qu.:49086 3rd Qu.:55.00
Max. :51623 Max. :80.00
DMDBORN2 INDFMPIR SDMVPSU SDMVSTRA
Length:10149 Min. :0.000 Min. :1.000 Min. :59.00
Class :character 1st Qu.:0.960 1st Qu.:1.000 1st Qu.:62.00
Mode :character Median :1.820 Median :2.000 Median :65.00
Mean :2.284 Mean :1.511 Mean :65.51
3rd Qu.:3.570 3rd Qu.:2.000 3rd Qu.:69.00
Max. :5.000 Max. :2.000 Max. :74.00
NA's :894
WTINT2YR WTMEC2YR OHXDECAY OHXREST
Min. : 2359 Min. : 0 Length:10149 Length:10149
1st Qu.: 11087 1st Qu.: 10976 Class :character Class :character
Median : 20519 Median : 20347 Mode :character Mode :character
Mean : 29277 Mean : 29277
3rd Qu.: 35545 3rd Qu.: 35669
Max. :186296 Max. :192771
LBXGLU WTSAF2YR LBXGH BMXBMI
Min. : 38.0 Min. : 0 Min. : 2.00 Min. :12.50
1st Qu.: 94.0 1st Qu.: 26035 1st Qu.: 5.20 1st Qu.:19.97
Median :101.0 Median : 54827 Median : 5.50 Median :25.16
Mean :109.8 Mean : 74987 Mean : 5.66 Mean :25.71
3rd Qu.:111.0 3rd Qu.: 94336 3rd Qu.: 5.80 3rd Qu.:30.08
Max. :584.0 Max. :404656 Max. :15.20 Max. :73.43
NA's :7039 NA's :6834 NA's :3722 NA's :1288
Begin.Year EndYear
Min. :2007 Min. :2008
1st Qu.:2007 1st Qu.:2008
Median :2007 Median :2008
Mean :2007 Mean :2008
3rd Qu.:2007 3rd Qu.:2008
Max. :2007 Max. :2008
summary(base_df_f)
SEQN RIDAGEYR RIAGENDR RIDRETH1
Min. :51624 Min. : 0.0 Length:10537 Length:10537
1st Qu.:54258 1st Qu.:10.0 Class :character Class :character
Median :56892 Median :29.0 Mode :character Mode :character
Mean :56892 Mean :32.6
3rd Qu.:59526 3rd Qu.:53.0
Max. :62160 Max. :80.0
DMDBORN2 INDFMPIR SDMVPSU SDMVSTRA
Length:10537 Min. :0.000 Min. :1.000 Min. :75.00
Class :character 1st Qu.:0.920 1st Qu.:1.000 1st Qu.:78.00
Mode :character Median :1.690 Median :2.000 Median :81.00
Mean :2.229 Mean :1.549 Mean :81.45
3rd Qu.:3.490 3rd Qu.:2.000 3rd Qu.:85.00
Max. :5.000 Max. :3.000 Max. :89.00
NA's :996
WTINT2YR WTMEC2YR OHXDECAY OHXREST
Min. : 3280 Min. : 0 Length:10537 Length:10537
1st Qu.: 12115 1st Qu.: 12125 Class :character Class :character
Median : 20029 Median : 19952 Mode :character Mode :character
Mean : 28656 Mean : 28656
3rd Qu.: 35051 3rd Qu.: 35668
Max. :153810 Max. :158147
LBXGLU WTSAF2YR LBXGH BMXBMI
Min. : 36.0 Min. : 0 Min. : 4.000 Min. :12.58
1st Qu.: 91.0 1st Qu.: 29569 1st Qu.: 5.200 1st Qu.:20.13
Median : 98.0 Median : 49447 Median : 5.500 Median :25.32
Mean :104.8 Mean : 70436 Mean : 5.673 Mean :25.92
3rd Qu.:107.0 3rd Qu.: 90248 3rd Qu.: 5.800 3rd Qu.:30.35
Max. :375.0 Max. :337219 Max. :16.400 Max. :84.87
NA's :7151 NA's :6956 NA's :3607 NA's :1125
Begin.Year EndYear
Min. :2009 Min. :2010
1st Qu.:2009 1st Qu.:2010
Median :2009 Median :2010
Mean :2009 Mean :2010
3rd Qu.:2009 3rd Qu.:2010
Max. :2009 Max. :2010
Combining data
In the reading, we saw how to combine multiple sets of observations by appending datasets. However, before combining multiple datasets we need to make sure corresponding columns between years can be combined.
NHANES will often change how it encodes or records different variables from year to year. Our first red flag are any columns whose name has changed. Let’s take a look at the column names.
It looks like DMDBORN changed after 2006 to DMDBORN2. If we check the demographic variable definitions for the three collection cycles DEMO_D, DEMO_E, and DEMO_F we can see that the encoding has changed over time.
To continue the analysis, let’s also convert these dataframes into tibbles so we can take advantage of Tidyverse functionality.
Finally, let’s also view the metadata of the columns.
DT::datatable(metadata)
Exercise
Resolve any changes in variable coding between the datasets. Use your own judgement on the best way to resolve any issues (doing nothing might also be a solution).
# Step 1: If needed, change things to have the same sets of values# TODO your code here# Step 2: Rename the DMDBORN2 columns to match between the datasetstib_e <-rename(tib_e, DMDBORN = DMDBORN2)tib_f <-rename(tib_f, DMDBORN = DMDBORN2)
Check table 1b to see how the authors chose to handle this in Beheshti et. al. 2021. Do you think there were any other factors which went into their decision?
Combining the data
Now we can combine the data together.
# Combine all years togethertib_all <-bind_rows(tib_d, tib_e, tib_f)
Data cleaning pipeline
We can choose to leave the variables with their NHANES column names or change them to be a bit more human-readable. Either way, we also want to convert each variable to it’s appropriate type or category.
Numeric variables
Let’s start by converting numeric columns:
# Set numeric columns to be numerictib_all <-mutate(tib_all,RIDAGEYR =as.numeric(RIDAGEYR),LBXGLU =as.numeric(LBXGLU),LBXGH =as.numeric(LBXGH),BMXBMI =as.numeric(BMXBMI))
Categorical variables
Many of the NHANES values are categorical data, but right now are stored as text. We can check what values exist by converting them to factors before making the decision of how to handle them in the analysis:
RIAGENDR RIDRETH1
Female:15633 Mexican American : 7388
Male :15401 Non-Hispanic Black : 6878
Non-Hispanic White :12463
Other Hispanic : 2683
Other Race - Including Multi-R: 1622
DMDBORN
"Born in 50 US States or Washi:25732
Born Elsewhere : 588
Born in Mexico : 2511
Born in Other Non-Spanish Spea: 1065
Born in Other Spanish Speaking: 1120
Don't Know : 2
Refused : 16
It looks like one of the options for DMDBORN has a lingering quote character. This can happen due to irregularities in how NHANES data is presented or small mistakes in data processing. We use the gsub function to replace all instances of double quotes with the empty string in the column.
We need to create the variables mentioned in the paper are not directly present in NHANES. For some variables such as country of birth, we could store the categories in DMDBorn but this way we can also keep the original distribution.
# Set columns to categories as in the papertib_all <-mutate(tib_all,age.cat =cut( RIDAGEYR,breaks =c(13, 15, 18, 100),include.lowest =TRUE,labels =c("13-15", "16-18", "19+")),plasma.glucose.cat =case_when( LBXGLU <100~"<100 mg/dl", LBXGLU <126~">=100 mg/dl and <126 mg/dl", LBXGLU >=126~">=126 mg/dl",.default =NA),hba1c.cat =case_when( LBXGH <5.7~"<5.7%", LBXGH >=5.7~">=5.7% and <6.5%", LBXGH >=6.5~">= 6.5%",.default =NA),bmi.cat =case_when( # If we were doing this from scratch we might also want an underweight category BMXBMI <25~"Normal", BMXBMI <30~"Overweight", BMXBMI >=30~"Obese",.default =NA), family.PIR.cat =case_when( INDFMPIR =="PIR value greater than or equa"~">= 1", INDFMPIR =="Value greater than or equal to"~">= 1",as.numeric(INDFMPIR) >=1~">=1",as.numeric(INDFMPIR) <1~"<1",.default =NA),birthplace =case_when( DMDBORN =="Born in 50 US States or Washi"~"Within the US",is.na(DMDBORN) ~NA, DMDBORN =="Don't Know"~NA, DMDBORN =="Refused"~NA,.default ="Outside the US"),dental.caries = OHXDECAY | OHXREST)
Exercise
Add a diabetes column to the dataset based on the guidelines from Beheshti et. al:
Individuals with an HbA1C of at least 6.5 percent or a plasma-fasting glucose value of at least 126 mg/dl were considered diabetics; prediabetics were those whose HbA1C ranged from 5.7 percent to 6.4 percent and whose fasting plasma glucose remained within 100 to 125 mg/dl; the remaining study participants, who had less than 5.7 percent HbA1C and less than 100 mg/dl fasting plasma glucose, were classified as nondiabetics.
It is recommended to use case_when, but feel free to use any method you want.
*Hint: remember to account for missing data! When should diabetes be NA?
While this is the data looked at in the paper, let’s take a look at some other diabetes data from the same NHANES cycles.
load("nhanes_diabetes.RData")
This contains 4 variables from the DIQ table:
DT::datatable(metadata_diabetes)
As before, we need to combine the three data cycles together.
# This time let's combine the dataframes firstdiabetes_all <-rbind(diabetes_d, diabetes_e, diabetes_f)# And then convert to a tibblediabetes_all <-as_tibble(diabetes_all) |>mutate(DIQ010 =as.factor(DIQ010),DIQ160 =as.factor(DIQ160),DIQ050 =as.factor(DIQ050)) |>select(-c(Begin.Year, EndYear))
We now need to combine the diabetes data with the rest of the nhanes data, adding in the new columns. This is a more complicated process than simply adding new rows to a dataset, as we need to determine which rows in the diabetes data are referencing the same participant as rows in the main data. This process is called joining or merging. To merge two datasets, we need to decide what to do with columns which appear in one dataset but not the other.
We also need to determine which variable in the dataset we want to use as the key when joining, the column(s) that identifies which rows in the each dataset we want to match. In the dplyr join functions this is referred to as the by argument.
In NHANES, we have the SEQN variable which uniquely identifies each participant in each survey cycle. We can use this as the key. In this case we want to perform join where we maintain all rows from the main dataset, but drop any rows which only exist in the diabetes dataset. Therefore, let’s do a left join (with the main dataset on the left).
tib_all <-left_join(tib_all, diabetes_all, by ="SEQN")summary(tib_all)
SEQN RIDAGEYR RIAGENDR
Min. :31127 Min. : 0.00 Female:15633
1st Qu.:38885 1st Qu.: 9.00 Male :15401
Median :46644 Median :25.00
Mean :46644 Mean :31.18
3rd Qu.:54402 3rd Qu.:51.00
Max. :62160 Max. :85.00
RIDRETH1 DMDBORN
Mexican American : 7388 Born Elsewhere : 588
Non-Hispanic Black : 6878 Born in 50 US States or Washi :25732
Non-Hispanic White :12463 Born in Mexico : 2511
Other Hispanic : 2683 Born in Other Non-Spanish Spea: 1065
Other Race - Including Multi-R: 1622 Born in Other Spanish Speaking: 1120
Don't Know : 2
Refused : 16
INDFMPIR SDMVPSU SDMVSTRA WTINT2YR
Min. :0.000 Min. :1.000 Min. :44.00 Min. : 1225
1st Qu.:0.960 1st Qu.:1.000 1st Qu.:55.00 1st Qu.: 10194
Median :1.820 Median :2.000 Median :65.00 Median : 19871
Mean :2.295 Mean :1.522 Mean :66.04 Mean : 28701
3rd Qu.:3.598 3rd Qu.:2.000 3rd Qu.:78.00 3rd Qu.: 36751
Max. :5.000 Max. :3.000 Max. :89.00 Max. :186296
NA's :2424
WTMEC2YR OHXDECAY OHXREST LBXGLU
Min. : 0 Mode :logical Mode :logical Min. : 36.0
1st Qu.: 9954 FALSE:14316 FALSE:7590 1st Qu.: 91.0
Median : 19584 TRUE :4079 TRUE :10805 Median : 98.0
Mean : 28701 NA's :12639 NA's :12639 Mean :105.4
3rd Qu.: 37143 3rd Qu.:107.0
Max. :192771 Max. :584.0
NA's :21410
WTSAF2YR LBXGH BMXBMI Begin.Year
Min. : 0 Min. : 2.000 Min. : 11.74 Min. :2005
1st Qu.: 21695 1st Qu.: 5.100 1st Qu.: 19.76 1st Qu.:2005
Median : 51495 Median : 5.400 Median : 24.94 Median :2007
Mean : 72640 Mean : 5.587 Mean : 25.58 Mean :2007
3rd Qu.: 98156 3rd Qu.: 5.700 3rd Qu.: 30.01 3rd Qu.:2009
Max. :404656 Max. :16.400 Max. :130.21 Max. :2009
NA's :20786 NA's :11184 NA's :3812
EndYear age.cat plasma.glucose.cat hba1c.cat
Min. :2006 13-15: 1805 Length:31034 Length:31034
1st Qu.:2006 16-18: 1855 Class :character Class :character
Median :2008 19+ :17719 Mode :character Mode :character
Mean :2008 NA's : 9655
3rd Qu.:2010
Max. :2010
bmi.cat family.PIR.cat birthplace dental.caries
Length:31034 Length:31034 Length:31034 Mode :logical
Class :character Class :character Class :character FALSE:5820
Mode :character Mode :character Mode :character TRUE :12575
NA's :12639
diabetes DIQ010 DID040 DIQ160
Length:31034 Borderline: 343 Min. : 1.00 Don't know: 40
Class :character Don't know: 23 1st Qu.: 40.00 No :18915
Mode :character No :27193 Median : 51.00 Yes : 693
Refused : 1 Mean : 59.53 NA's :11386
Yes : 2037 3rd Qu.: 62.00
NA's : 1437 Max. :999.00
NA's :28997
DIQ050
No :29004
Refused: 1
Yes : 592
NA's : 1437
Let’s also check the number of NA’s in the new columns to make sure the data has matched up.
#Base R wayna_counts <-colSums(is.na(tib_all))na_counts <-data.frame(na_counts[order(na_counts, decreasing =TRUE)])colnames(na_counts) <-c("NA Count")DT::datatable(na_counts)
Unfortunately, it looks DID040 won’t be very useful.
Exploring Other Diabetes Variables
We’ll start by checking the consistency of the DIQ variables. Was there anyone who said they were both diabetic and prediabetic? Here we’re using the table function, which will give us a nice counts table as output. We’ll see some other ways to make tables next week.
We need to filter the summary population. Luckily, the paper provides total numbers of participants for most of the filtering steps performed. This makes it much easier for us to check our work.
The total number of individuals who participated in the NHANES from 2005 to 2010 were 31,034 study-participants. Among them, 3,660 were nonedentulous adolescents (aged 13 to 18 years old at the time of screening) who had exam survey weights assigned. From those 3,660 adolescents, data to measure the main outcome variable (dental caries experience) were available for a final sample of 3,346 adolescents, representing the population of 24,386,135 U.S. adolescents after applying the NHANES sample weights, which was described and analyzed in this study.
We can now calculate values for adolescents with dental caries experience.
Exercise
Calculate the percent of adolescents with dental caries experience.
# Recall that you can sum boolean variables to get a countsum(all_ado$dental.caries)/nrow(all_ado)
[1] 0.6007173
Exporting data
Now that you have learned how to extract information from or summarise your raw data, you may want to export these new data sets to share them with your collaborators or for archival.
Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.