The goal of this vignette is to provide some details that will help understand the analysis presented in ``Association of blood cobalt concentrations with dyslipidemia, hypertension, and diabetes in a US population A cross-sectional study; Hongxin Wang, MD, Feng Li, MD, Jianghua Xue, MD, Yanshuang Li, MD, Jiyu Li, MD’’. For the remainder of this vignette we will refer to this as the “Cobalt paper”, This vignette is incomplete and is not attempting to get identical results of the published paper. Our goal is to demonstrate how one might attempt a replication, but to leave some important details to the reader, who will be able to extend the analysis and if they want to make an attempt to obtain identical results.
The authors report using data for the years 2015-2016 and 2017-2018 which cover two of the two-year reporting epochs in NHANES. The Questionnaires we want will have the suffixes _I and _J.
# install EnWAS package via: devtools::install_github("ccb-hms/EnWAS")
library(splines)
library(ggplot2)
library(ggpubr)
library(dplyr)
library(nhanesA)
library(phonto)
library(EnWAS)
library(knitr)
In the next few sections we attempt to load data from NHANES in a manner that is consistent with the description provided in Section 2 of the Cobalt paper.
The authors state: Participants with cobalt and lipid data were included (n = 6866). Demographic characteristics of the participants, including age, gender, body mass index (BMI), education level, race, family poverty-income ratio and smoking status, were collected. Clinical data, such as blood pressure, total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), HDL-C, triglycerides (TGs), hypertension, diabetes and history of medication use, including antihypertensive drugs, hypoglycemic drugs, and lipid-lowering drugs, were extracted.
In the code below, we start with the variable names, which we had obtained by searching based on the variable descriptions (not shown) and restrict by the years that the authors had chosen. We will avoid looking at the LDL measurements and triglycerides as they were done on a subset of the participants, and we want to make sure we are as inclusive as possible.
##get the appropriate table names for the variables we will need
##BP
BPTabs = nhanesSearchVarName("BPQ050A", ystart="2015", ystop="2018")
LDLTabs = nhanesSearchVarName('LBDLDL',ystart="2015", ystop="2018")
##BPQ050A - currently taking meds for hypertension
##BPQ080 - told by Dr. you have high cholesterol
##BPQ100D - now taking meds for high cholesterol
##A1C
A1C = nhanesSearchVarName("LBXGH",ystart="2015", ystop="2018")
##been told by Dr. has diabetes
DrDiab = nhanesSearchVarName("DIQ010",ystart="2015", ystop="2018")
##DIQ050 - taking insulin now
##DIQ070 - taking pills for blood sugar
##HDLTabs
HDLTabs = nhanesSearchVarName("LBDHDD",ystart="2015", ystop="2018")
BMITabs = nhanesSearchVarName("BMXBMI", ystart="2015", ystop="2018")
BMXTabs = nhanesSearchVarName("BMXBMI",ystart="2015", ystop="2018")
DIQTabs = nhanesSearchVarName("DIQ010",ystart="2015", ystop="2018")
COBTabs = nhanesSearchVarName("LBXBCO",ystart="2015", ystop="2018" )
TotChol = nhanesSearchVarName("LBXTC",ystart="2015", ystop="2018" )
cols = list(DEMO_I=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
DEMO_J=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
BPQ_I=c('BPQ050A','BPQ020','BPQ080','BPQ100D'),
BPQ_J=c('BPQ050A','BPQ020','BPQ080','BPQ100D'),
HDL_I=c("LBDHDD"),HDL_J=c("LBDHDD"),
GHB_I="LBXGH",GHB_J="LBXGH",
DIQ_I=c("DIQ010","DIQ050","DIQ070","DIQ160"),
DIQ_J=c("DIQ010","DIQ050","DIQ070","DIQ160"),
BMX_I="BMXBMI", BMX_J="BMXBMI",
TCHOL_I="LBXTC", TCHOL_J="LBXTC"
)
var2Table = cols[c(1,3,5,7,9,11,13)]
base_df <- jointQuery(cols)
dim(base_df)
#> [1] 19225 19
##FIXUP some vars
cholMeds = base_df$BPQ100D
cholMeds[base_df$BPQ080=="No"] = "No"
cholMeds[cholMeds=="Don't know"] = NA
cholMeds = factor(cholMeds)
table(cholMeds,useNA="always")
#> cholMeds
#> No Yes <NA>
#> 9225 2013 7987
base_df$cholMeds=cholMeds
##now fixup the oral meds for diabetes
##not counting insulin right now...might need it
dontskip = base_df$DIQ010 == "Yes" | base_df$DIQ010 == "Borderline" | base_df$DIQ160 == "Yes"
hypoglycemicMeds = base_df$DIQ070
hypoglycemicMeds[!dontskip] = "No"
hypoglycemicMeds = factor(hypoglycemicMeds,levels=c("Yes", "No", "Don't know","Refused"), labels=c("Yes", "No",NA,NA))
table(hypoglycemicMeds,useNA="always")
#> hypoglycemicMeds
#> Yes No <NA>
#> 1360 12454 5411
base_df$hypoglycemicMeds = hypoglycemicMeds
##Smoking
In the code chunk below we extract the smoking data, and then try to
create the groupings used in the paper. They have three groups,
non-smokers, current smokers and ex-smokers. We use the
SMQ_I
and SMQ_J
tables. We will define
non-smoker as someone who as never smoked more than 100 cigarettes
(SMQ020
), anyone who has smoked more will be either a
current smoker or an ex-smoker (SMQ040
)
cols = list(SMQ_I=c("SMQ020","SMQ040"), SMQ_J=c("SMQ020","SMQ040"))
smokingTab= unionQuery(cols)
tdf = merge(base_df, smokingTab, all.x=TRUE)
tdf = tdf[tdf$RIDAGEYR>=40,]
##nn = nhanesTranslate("SMQ_J", colnames=c("SMQ020","SMQ040"))
##have a look at the coding for SMQ020
##nn[[1]]
##check to see what values are in the data
table(tdf$SMQ020, useNA="always")
#>
#> Don't know No Refused Yes <NA>
#> 7 4173 1 3467 0
##for SMQ040 too
table(tdf$SMQ040, useNA="always")
#>
#> Every day Not at all Some days <NA>
#> 1053 2165 249 4181
smokingVar = ifelse(tdf$SMQ020=="No", "Non-smoker",
ifelse(tdf$SMQ040=="Not at all", "Ex-smoker",
"Smoker"))
table(smokingVar, useNA="always")
#> smokingVar
#> Ex-smoker Non-smoker Smoker <NA>
#> 2165 4173 1302 8
##from the paper n=6866, Ex=1950,Non=3744,Current=1165
#Glucose
##fasting glucose
Fastgluc = nhanesSearchVarName("LBXGLU", ystart="2015", ystop="2018")
glucTab = unionQuery(list(GLU_I="LBXGLU", GLU_J="LBXGLU"))
base_df = merge(base_df, glucTab, all.x=TRUE)
The two variables that require additional lab work are LDLs and
cobalt levels. As a result they were done on a much smaller set of
people. So instead of loading them simultaneously we extract them below
and then use the merge
ldlTab = unionQuery(list(TRIGLY_I=c("LBXTR","LBDLDL"),TRIGLY_J=c("LBXTR","LBDLDL")))
dim(ldlTab)
#> [1] 6227 5
cobaltTab = unionQuery(list(CRCO_I="LBXBCO", CRCO_J="LBXBCO"))
dim(cobaltTab)
#> [1] 7286 4
var2Table = c(var2Table, list("TRIGLY_I"=c("LBXTR","LBDLDL"), CRCO_I="LBXBCO"))
##we merge keeping as many records as we can, the all.x argument is important
bdf = merge(base_df, ldlTab, all.x=TRUE)
base_df = merge(bdf, cobaltTab, all.x=TRUE)
dim(base_df)
#> [1] 19225 25
If we were to remove all records with missing values at this point we would be left with a very small number of cases. So we will try to be careful not to loose too many data points.
In the next code segment we show how to obtain blood pressure measurements from NHANES tables. We have already ascertained that these measurements are contained in Questionnaires that are named BPX_I and BPX_J and that there were replicate measurements taken. Both systolic (BPXS) and diastolic (BPXD) measurements were taken on each of two occassions. We will use the average of these two measurements for individuals with two measurements, and in the case where only one measurement is available we will use it. The authors of the Cobalt paper don’t specify which values they used, so this is one of the places where our analysis may differ from theirs.
bptablenames = nhanesSearchTableNames('BPX[_]')
bptablenames |> kable()
x |
---|
BPX_B |
BPX_C |
BPX_D |
BPX_E |
BPX_F |
BPX_G |
BPX_H |
BPX_I |
BPX_J |
We can see that blood pressure data was collected for other years as well, but for now we will just extract the data for the 2015-2016 and 2017-2018 years. We combine these into a single dataframe.
blood_df <- unionQuery(list(BPX_I=c("BPXDI1","BPXDI2","BPXSY1","BPXSY2"),
BPX_J=c("BPXDI1","BPXDI2","BPXSY1","BPXSY2")))
dim(blood_df)
#> [1] 18248 7
# Average the the first and second reads
# taking some care to keep one measurement if the other is missing
blood_df$DIASTOLIC <- rowMeans(blood_df[, c("BPXDI1", "BPXDI2")], na.rm=TRUE)
blood_df$DIASTOLIC[is.na(blood_df$BPXDI1) & is.na(blood_df$BPXDI2)] = NA
blood_df$SYSTOLIC <- rowMeans(blood_df[, c("BPXSY1", "BPXSY2")], na.rm=TRUE)
blood_df$SYSTOLIC[is.na(blood_df$BPXSY1) & is.na(blood_df$BPXSY2)] = NA
dim(blood_df)
#> [1] 18248 9
blood_df[1:10,] |> kable()
SEQN | BPXDI1 | BPXDI2 | BPXSY1 | BPXSY2 | Begin.Year | EndYear | DIASTOLIC | SYSTOLIC |
---|---|---|---|---|---|---|---|---|
85432 | 70 | 70 | 112 | 120 | 2015 | 2016 | 70 | 116 |
98018 | 48 | 0 | 120 | 128 | 2017 | 2018 | 24 | 124 |
90883 | 56 | 52 | 108 | 118 | 2015 | 2016 | 54 | 113 |
92397 | 60 | 58 | 116 | 114 | 2015 | 2016 | 59 | 115 |
101919 | 66 | 64 | 112 | 116 | 2017 | 2018 | 65 | 114 |
102725 | 78 | 74 | 110 | 114 | 2017 | 2018 | 76 | 112 |
102910 | 60 | 66 | 136 | 134 | 2017 | 2018 | 63 | 135 |
100360 | 74 | 70 | 104 | 104 | 2017 | 2018 | 72 | 104 |
101817 | 74 | 70 | 122 | 118 | 2017 | 2018 | 72 | 120 |
94417 | NA | NA | NA | NA | 2017 | 2018 | NA | NA |
In our analysis we can then look at the average of the measurements across the two different time points as a way to estimate the actual blood pressure for each participant.
Now you have a dataframe with the data, but you will need to better
understand the variables, exposures and responses. To help with that we
have created some tools, based on related work in the UK Biobank called
PHESANT (cite PHESANT)). The process takes each variable (column) and
reports what type of data it is, continuous or multi-level (ie factors).
We report the number of levels, you can use the
nhanesTranslate
function to learn more about the levels of
the factor. Some numeric quantities are reported such as the ratio of
unique values to the length of vector, the proportion of zeros, and the
proportion of missing values. NHANES has chosen to store categorical
data types (ordered or unordered) as integers. For example, education
level, DMDEDUC2
is identified as Multilevel-8
.
You can learn more about the details of the NHANES data in the Quick
Start Vignette (cite quick start).
data <- merge(base_df, blood_df, all.x=TRUE,by=c("SEQN", "Begin.Year", "EndYear"))
data$years = as.factor(paste0(data$Begin.Year,"-", data$EndYear))
##fix up our list linking variable names to the table they came from
var2Table = c(var2Table, list("BPX_I"=c("BPXDI1","BPXDI2","BPXSY1","BPXSY2")))
phs_dat = phesant(data)
data = phs_dat$data
DT::datatable(phs_dat$phs_res)
In the next code chunk we will convert the multilevel variables into
R factors. To do this we make use of the nhanesTranslate function. While
that function does most of the work, there are some variables we need to
deal with manually to replicate the analysis. First, the years variable
is something we added, so we need to deal with it manually. Then we use
nhanesTranslate
to transform all of the internal values.
Then we need to do a little more work on modifying levels of the
education variable. The authors decided to group education into three
levels, less than high school (<HS), high school (HS), and more than
school-school(>HS). We also need to address some issues around the
hypertension variables. Particpants were asked (BPQ020) whether they had
ever been told by a doctor or health care professional that they had
high blood pressure. The survey was designed to then skip over a number
of questions about their hypertension if they said they did not have
hypertension. However this then introduces missing values in the
question BPQ050A, which was whether or not they were currently taking a
prescription for hypertension. Presumably only people who had been told
by their doctor they have hypertension would be taking a perscription
and we will need to manually adjust the data so that these are answered
No, rather than missing.
data$DMDEDUC2 = factor(data$DMDEDUC2)
levels(data$DMDEDUC2) <- c("<HS",">HS",NA,"HS","<HS",NA,">HS")
##
## fixup the data for a skipped question
hypertensiveMeds = data$BPQ050A
hypertensiveMeds[data$BPQ020=="No"] = "No"
hypertensiveMeds[data$BPQ040A=="No"] = "No"
data$BPQ050A = hypertensiveMeds
##remove any record with at least one NA and then subset to those over 40
##data <- na.omit(data)
data <- data[data$RIDAGEYR>=40,]
dim(data)
#> [1] 7648 32
At this point we have 7648 individuals left.
Table 1 provides some basic summaries of the demographic data. We will create a temporary subset of the data that uses only the complete cases.
pcobalt = ifelse(data$LBXBCO <= 0.12, "<=0.12",
ifelse(data$LBXBCO >= 0.13 & data$LBXBCO <= 0.14, "0.13-0.14",
ifelse(data$LBXBCO >= 0.15 & data$LBXBCO <= 0.18, "0.15-0.18",
ifelse(data$LBXBCO >= 0.19, ">=1.9",
NA) )))
##make it an ordered factor
pcob = factor(pcobalt, levels=c("<=0.12","0.13-0.14", "0.15-0.18",">=1.9"), ordered=TRUE)
data$pcobalt = pcob
table(pcob, useNA="always")
#> pcob
#> <=0.12 0.13-0.14 0.15-0.18 >=1.9 <NA>
#> 1943 1433 1820 1778 674
AgeGp = data |> group_by(pcobalt) |> summarise(mean=mean(RIDAGEYR,na.rm=TRUE),SD=sd(RIDAGEYR,na.rm=TRUE))
Here we implement the definitions from Section 3.1 of the Cobalt paper. For hypertension they described using reported systolic and diastolic blood pressure measurements as well as self-reported statements regarding whether a physician had ever told them that they have high blood pressure.
Note that it is unclear whether the authors used averaged over 2 measurements for the systolic and diastolic blood pressure measurements. Still, we use average them because it would give us more accurate blood pressure measurements.
One might also look at the use of prescribed hypertensives, as these will modulate the systolic and diastolic measures. Data on self-report come from the BPQ tables in NHANES. https://wwwn.cdc.gov/nchs/nhanes/2011-2012/BPQ_G.htm
# "Hypertension was defined as systolic blood pressure (SBP) ≥140 mm Hg, diastolic blood pressure ≥90mm Hg, or the use of antihypertensive medication. "
data$hypertension <- data$DIASTOLIC >= 90 | data$SYSTOLIC >= 140 | data$BPQ050A=="Yes"
barplot(table(data$hypertension))
data$diabetes = data$DIQ010 == "Yes" | data$LBXGLU > 110 | data$LBXGH > 6.5
barplot(table(data$diabetes))
data$HighLDL = data$LBDLDL > 130
barplot(table(data$HighLDL))
data$LowHDL = (data$RIAGENDR=="Male" & data$LBDHDD < 40) | (data$RIAGENDR=="Female" & data$LBDHDD < 50)
barplot(table(data$LowHDL))
Now lets define the elevated total cholesterol variable.
elevatedTC = data$LBXTC>200
data$elevatedTC = elevatedTC
Note that some of our groupings are very similar to those reported in the Cobalt paper, but some, notably Elevated LDLs are quite different. This discrepency should be explored (it may have to do with incorrect subsetting by age). ## 2.5 Compare with Table-2
DBP = data |> group_by(pcobalt) |> summarise(mean=mean(DIASTOLIC, na.rm=TRUE),SD=sd(DIASTOLIC,na.rm=TRUE))
DBP$stat = paste(round(DBP$mean,1),"±",round(DBP$SD,1))
DBPmn = mean(data$DIASTOLIC, na.rm=TRUE)
DBPsd = sd(data$DIASTOLIC, na.rm=TRUE)
SBP = data |> group_by(pcobalt) |> summarise(mean=mean(SYSTOLIC,na.rm=TRUE),SD=sd(SYSTOLIC, na.rm=TRUE))
SBP$stat = paste(round(SBP$mean,1),"±",round(SBP$SD,1))
SBPmn = mean(data$SYSTOLIC, na.rm=TRUE)
SBPsd = sd(data$SYSTOLIC, na.rm=TRUE)
dbp_t = t(DBP)
colnames(dbp_t) = DBP$pcobalt
sbp_t = t(SBP)
colnames(sbp_t) = SBP$pcobalt
table2 = rbind(sbp_t["stat",],dbp_t["stat",])
table2 = table2[,c("<=0.12","0.13-0.14","0.15-0.18",">=1.9")]
table2 = cbind("Blood Pressures"=c("SBP (mm Hg), mean±SD","DBP (mm Hg), mean±SD"),table2)
It shows the number we have is not exactly the same as the one in the table-2 in the paper. The authors did not use the average of two reads of the blood pressure measurements.
library(kableExtra)
kbl(table2) |>
kable_classic() |>
add_header_above(c(" " = 1, "Cobalt Quartiles (ug/L)" = 4))
Cobalt Quartiles (ug/L)
|
||||
---|---|---|---|---|
Blood Pressures | <=0.12 | 0.13-0.14 | 0.15-0.18 | >=1.9 |
SBP (mm Hg), mean±SD | 129.7 ± 18.3 | 131.3 ± 19.1 | 132.6 ± 20.5 | 131.6 ± 21.4 |
DBP (mm Hg), mean±SD | 72.5 ± 13.4 | 73 ± 12.9 | 71.9 ± 13.4 | 70.2 ± 14.1 |
The authors don’t seem to explore the relationship between taking medications (eg. insulin and oral hypoglycemic drugs) and disease (eg diabetes, or fasting glucose rate). For hypertension, hypoglycemia and dislipemia it seems like these would be interesting relationships to explore.
In Section 3.2 of the Cobalt paper the authors describe their use of binary logistic regression models. They use dyslipidemia as the outcome and adjust for age, sex and BMI (their model 1). They split cobalt levels into the groupings described above and then fit logistic models that were linear in the covariates. Here we provide the tools to replicate that analysis and also explore the use of regression splines to fit the continuous variables in the model, namely age, BMI and cobalt levels.
In the following section, we run the logistic regression models as generalized linear models (GLMs). In the models, the outcome of the hypertension indicator and the adjusted variables are age (RIDAGEYR), gender (RIAGENDR), BMI (BMXBMI), education (DMDEDUC2), and ethnicity (RIDRETH1). The first GLM is with linear terms, and the second GLM also adds terms linearly together but applies a natural spline to the continuous variables.
subSet = data[, c("hypertension","RIDAGEYR", "RIAGENDR", "BMXBMI","DMDEDUC2", "RIDRETH1")]
subSet = na.omit(subSet)
lm_logit <- glm(hypertension ~ RIDAGEYR + RIAGENDR + BMXBMI+DMDEDUC2+RIDRETH1, data = subSet, family = "binomial",na.action=na.omit)
##spline covariates
ns_logit <- glm(hypertension ~ ns(RIDAGEYR,df=7)+RIAGENDR + ns(BMXBMI,df=7) + DMDEDUC2 + RIDRETH1,
data = subSet, family = "binomial",na.action=na.omit)
###Model 2 In the code below we fit model two.
## glm linear in the covariates
##we first want to reduce to those that have values for these covariates
subSet = data[, c("hypertension","RIDAGEYR", "RIAGENDR", "BMXBMI","DMDEDUC2", "RIDRETH1")]
subSet = na.omit(subSet)
lm_logit <- glm(hypertension ~ RIDAGEYR + RIAGENDR + BMXBMI+DMDEDUC2+RIDRETH1, data = subSet, family = "binomial",na.action=na.omit)
##spline covariates
ns_logit <- glm(hypertension ~ ns(RIDAGEYR,df=7)+RIAGENDR + ns(BMXBMI,df=7) + DMDEDUC2 + RIDRETH1,
data = subSet, family = "binomial",na.action=na.omit)
# library(pROC)
library(plotROC)
test = data_frame(hypertension=subSet$hypertension,lm=lm_logit$fitted.values, ns=ns_logit$fitted.values)
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> ℹ Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
longtest <- reshape2::melt(test,id.vars="hypertension")
colnames(longtest) = c('hypertension','model','value')
ggplot(longtest, aes(d = as.numeric(hypertension), m = value, color = model))+ geom_abline()+ geom_roc(size = 1.25) + style_roc()
# plot(roc(data$hypertension,
# fitted(lm_logit)),
# print.auc = T,
# col = "red")
#
# plot(roc(data$hypertension,
# fitted(ns_logit)),
# print.auc = T,
# col = "blue",
# add = T)
# Age
df_age_fitt = list("Binned Data"=make_bins(x=subSet$RIDAGEYR,y=as.numeric(subSet$hypertension),nbin=600),
"Linear"=make_bins(x=subSet$RIDAGEYR,y=lm_logit$fitted.values,nbin=600),
"Spline"=make_bins(x=subSet$RIDAGEYR,y=ns_logit$fitted.values,nbin=600)
)
#> Warning in min(xx[xx > upper]): no non-missing arguments to min; returning Inf
#> Warning in min(xx[xx > upper]): no non-missing arguments to min; returning Inf
#> Warning in min(xx[xx > upper]): no non-missing arguments to min; returning Inf
age_fitt = plot_bins2(df_age_fitt,xlab="Age (year)",ylab="Hypertension",is_facet=F)
#BMI
df_bmi_fit =list("Linear"=make_bins(x=subSet$BMXBMI,y=lm_logit$fitted.values,nbin=600),
"Spline"=make_bins(x=subSet$BMXBMI,y=ns_logit$fitted.values,nbin=600),
"Binned Data"=make_bins(x=subSet$BMXBMI,y=as.numeric(subSet$hypertension),nbin=600)
)
bmi_fit <- plot_bins2(df_bmi_fit,xlab="BMI",ylab="Hypertension",is_facet=F)
The following plots show binned Hypertension data; each bin contains about 600 data points and we compute the proportion of the participants who reported hypertension. Linear and Spline present the fitted values (probabilities) from the GLM with linear terms and apply the natural spline function on continuous terms of the participants who have hypertension. For both age, panel a), and BMI, panel b), the GLM model using splines agrees with the estimates obtained by binning, while when these terms are modeled using a simple linear term there are more substantial discrepancies.
To compute the model estimates for each bin we simply average the computed fitted values (which are defined to be back-transformed to probabilities for logistic regression) over the same individuals in each bin. One might want to examine the relationship on the logit scale, which is easily done.
ggpubr::ggarrange(age_fitt,bmi_fit,nrow = 1,ncol = 2,labels = c('a)','b)'))
As the authors pointed out, the blood cobalt concentrations are not associated with the risk of hypertension based on the following summary table. The cobalt concentration does not significantly impact hypertension. FIXME: but they have a bunch of other features in their table 2 - and it would be good if we can start to look at them.
subSet2 = data[, c("hypertension","RIDAGEYR", "RIAGENDR", "BMXBMI","DMDEDUC2", "RIDRETH1", "LBXBCO")]
subSet2 = na.omit(subSet2)
lm_logit <- glm(hypertension ~ RIDAGEYR + RIAGENDR + BMXBMI+DMDEDUC2+RIDRETH1+LBXBCO, data = subSet2, family = "binomial")
ns_logit <- glm(hypertension ~ ns(RIDAGEYR,df=7)+RIAGENDR + ns(BMXBMI,df=7) + DMDEDUC2 + RIDRETH1+LBXBCO,
data = subSet2, family = "binomial",na.action=na.omit)
sjPlot::tab_model(lm_logit,ns_logit,
dv.labels = c("lm", "spline"),
show.ci = FALSE,show.stat = TRUE,show.se = TRUE,p.style = "scientific", digits.p = 2)
lm | spline | |||||||
---|---|---|---|---|---|---|---|---|
Predictors | Odds Ratios | std. Error | Statistic | p | Odds Ratios | std. Error | Statistic | p |
(Intercept) | 0.00 | 0.00 | -27.56 | 3.18e-167 | 0.09 | 0.04 | -5.08 | 3.78e-07 |
RIDAGEYR | 1.08 | 0.00 | 28.96 | 1.90e-184 | ||||
RIAGENDR [Male] | 1.03 | 0.06 | 0.58 | 5.59e-01 | 1.01 | 0.06 | 0.26 | 7.95e-01 |
BMXBMI | 1.08 | 0.00 | 16.55 | 1.59e-61 | ||||
DMDEDUC2 [>HS] | 0.79 | 0.06 | -3.15 | 1.61e-03 | 0.78 | 0.06 | -3.29 | 1.00e-03 |
DMDEDUC2 [HS] | 1.02 | 0.09 | 0.26 | 7.97e-01 | 1.00 | 0.09 | 0.05 | 9.58e-01 |
RIDRETH1 [Non-Hispanic Black] |
2.77 | 0.27 | 10.40 | 2.45e-25 | 2.86 | 0.28 | 10.57 | 4.25e-26 |
RIDRETH1 [Non-Hispanic White] |
1.08 | 0.10 | 0.83 | 4.04e-01 | 1.14 | 0.11 | 1.42 | 1.57e-01 |
RIDRETH1 [Other Hispanic] | 1.43 | 0.15 | 3.36 | 7.81e-04 | 1.42 | 0.15 | 3.24 | 1.21e-03 |
RIDRETH1 [Other Race - Including Multi-Racial] |
1.69 | 0.18 | 5.02 | 5.28e-07 | 1.77 | 0.19 | 5.38 | 7.51e-08 |
LBXBCO | 1.02 | 0.06 | 0.42 | 6.73e-01 | 1.03 | 0.06 | 0.52 | 6.06e-01 |
RIDAGEYR [1st degree] | 3.09 | 0.57 | 6.06 | 1.37e-09 | ||||
RIDAGEYR [2nd degree] | 4.87 | 1.08 | 7.11 | 1.13e-12 | ||||
RIDAGEYR [3rd degree] | 7.02 | 1.37 | 10.02 | 1.21e-23 | ||||
RIDAGEYR [4th degree] | 11.37 | 2.78 | 9.95 | 2.62e-23 | ||||
RIDAGEYR [5th degree] | 10.21 | 2.22 | 10.67 | 1.37e-26 | ||||
RIDAGEYR [6th degree] | 23.64 | 8.39 | 8.92 | 4.87e-19 | ||||
RIDAGEYR [7th degree] | 17.19 | 2.31 | 21.20 | 9.08e-100 | ||||
BMXBMI [1st degree] | 1.95 | 0.81 | 1.60 | 1.09e-01 | ||||
BMXBMI [2nd degree] | 1.88 | 0.93 | 1.28 | 2.00e-01 | ||||
BMXBMI [3rd degree] | 2.34 | 1.07 | 1.86 | 6.27e-02 | ||||
BMXBMI [4th degree] | 2.83 | 1.32 | 2.23 | 2.55e-02 | ||||
BMXBMI [5th degree] | 14.85 | 6.47 | 6.20 | 5.74e-10 | ||||
BMXBMI [6th degree] | 12.11 | 14.56 | 2.07 | 3.81e-02 | ||||
BMXBMI [7th degree] | 21.08 | 32.35 | 1.99 | 4.70e-02 | ||||
Observations | 6576 | 6576 | ||||||
R2 Tjur | 0.194 | 0.198 |
FIXME: This likely belongs in the QA/QC section. The point of this
code is to show the reader how they can estimate the functional form of
the spline that they are fitting to the data. To do that, we pick a
covariate, say Age, where we want to compute the spline. Then we pick a
set of Age values that cover the range of ages in the model.
To get predictions from the model for a specific age we also need to
specify values for all the other covariates in the model.
Our suggestion is that for categorical variables choose the most common
category and for continuous variables use the median value.
*** Robert, you only keep one base model in the end, do you want to me plot or compare somethings*** FIXME: yes the point here is to develop a better summary of the comparison of models with spline terms. I really don’t like the R output that shows each term individually, as you can’t really interpret them and they take up a lot of room. I think we should instead, create one line for each spline term, and in it only put the value of the LRT comparing the model with the spline to the model without it. From that comparison you can get the chi-squared statistic, the p-value and the df and those could be put into the table. I think that would be a better thing.
# base_logit <- glm(hypertension ~ RIDAGEYR + RIAGENDR + BMXBMI+DMDEDUC2+RIDRETH1 + sqrt(LBXBCO), data = data, family = "binomial")
# base_logit <- glm(hypertension ~ ns(RIDAGEYR,df=7) + RIAGENDR + ns(BMXBMI,df=7) + DMDEDUC2 + RIDRETH1 + ns(sqrt(LBXBCO), df=7) + years, data = data, family = "binomial")
base_logit <- glm(hypertension ~ ns(RIDAGEYR,df=7) + RIAGENDR + ns(BMXBMI,df=7) + DMDEDUC2 + RIDRETH1, data = data, family = "binomial")
## try to do some prediction - and then get a plot of the age spline
##46 of these
yvals = seq(40,85,by=1)
dfimpute = data.frame(RIDAGEYR=yvals, RIAGENDR=rep("Male", 46), BMXBMI=rep(28.9, 46), DMDEDUC2=rep("HS", 46), RIDRETH1=rep("Non-Hispanic White", 46))
predV = predict(base_logit, newdata=dfimpute)
# lines(40:85, predV)
qplot(40:85,predV,geom = "line") + theme_bw()
##now look at BMI
yBMI = 14:80
dfBMIimpute = data.frame(RIDAGEYR=rep(60,67) , RIAGENDR=rep("Male", 67), BMXBMI=yBMI, DMDEDUC2=rep("HS", 67), RIDRETH1=rep("Non-Hispanic White", 67))
predBMI = predict(base_logit, newdata=dfBMIimpute)
qplot(14:80,predBMI,geom = "line") + theme_bw()