= ""
nhanesDataPath load("d4.rda")
load("metaD.rda")
::datatable(metaD) DT
Week 7: Advanced Regression
This week we’ll take a deeper look at regression models using a different NHANES dataset.
Load the data
There are 6063 observations, some are incomplete and have missing values for some covariates. There are 22 covariates, which have cryptic names and you need to use the meta-data to resolve them. The survey is very complex and typically any analysis requires a substantial amount of reading of the documentation. Here we will guide you past some of the hurdles.
We load up the data and the metadata. In the metadata we have a textual description of the phenotype, the short name, and the target. The target tells us which of the sampled individuals was eligible to answer the question.
We will look at the relationship between the variable LBXTC (which is Total Cholesterol in mg/dL measured by a blood draw) and the age of the participant in years.
And we can see that in this plot, over-plotting is a substantial issue here. You might also notice what seems like a lot of data at age 80, this is because any age over 80 was truncated to 80 to prevent reidentification of survey participants. In a complete analysis, this should probably be adjusted for in some way, but we will ignore it for now.
We can try some other methods, such as hexbin
plotting and smoothScatter
to get a better idea of the distribution of the data.
Now we can see a few outliers - with extremely high serum cholesterol. We get a sense that the trend is not exactly a straight line, but rather a parabola, lower for the young and the old and a bit higher in the middle.
We fit a linear model first.
= lm(d4$LBXTC ~ d4$RIDAGEYR)
lm1 summary(lm1)
Call:
lm(formula = d4$LBXTC ~ d4$RIDAGEYR)
Residuals:
Min 1Q Median 3Q Max
-114.68 -27.95 -2.91 23.34 357.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 170.0140 1.4340 118.56 <2e-16 ***
d4$RIDAGEYR 0.3708 0.0285 13.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.24 on 5689 degrees of freedom
(372 observations deleted due to missingness)
Multiple R-squared: 0.02891, Adjusted R-squared: 0.02874
F-statistic: 169.4 on 1 and 5689 DF, p-value: < 2.2e-16
plot(lm1$fitted.values, lm1$residuals)
##fit a loess curve
= loess(lm1$residuals ~ lm1$fitted.values)
l2 = predict(l2, newdata=sort(lm1$fitted.values))
pl lines(x=sort(lm1$fitted.values), y=pl, col="blue", lwd=2)
abline(h=0, col="red")
Notice that both terms in the model are very significant, but that the multiple \(R^2\) is only around 2%.
So age, in years, is not explaining very much of the variation. But because we have such a large data set, the parameter estimates are found to be significantly different from zero.
Spline Models
When a linear model does not appear sufficient we can try other models. One choice is to use natural splines, which are very flexible. They create a “knotted” line, where at each knot the slope of the line can change. They are based on B-splines with the prevision that the model is linear outside the range of the data.
However, we have to decide the number of knots we want to use in the model. Based on the initial analysis, we chose to use df=7, which gives five internal knots when fitting the splines. You have almost 6,000 degrees of freedom here, so using up a few to get a more appropriate fit seems good.
library("splines")
= lm(d4$LBXTC ~ ns(d4$RIDAGEYR, df=7))
lm2 summary(lm2)
Call:
lm(formula = d4$LBXTC ~ ns(d4$RIDAGEYR, df = 7))
Residuals:
Min 1Q Median 3Q Max
-113.43 -26.32 -2.88 22.47 343.31
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 154.799 2.178 71.071 < 2e-16 ***
ns(d4$RIDAGEYR, df = 7)1 39.956 3.379 11.826 < 2e-16 ***
ns(d4$RIDAGEYR, df = 7)2 32.705 4.074 8.028 1.20e-15 ***
ns(d4$RIDAGEYR, df = 7)3 55.583 3.637 15.283 < 2e-16 ***
ns(d4$RIDAGEYR, df = 7)4 42.275 3.725 11.347 < 2e-16 ***
ns(d4$RIDAGEYR, df = 7)5 30.111 3.352 8.984 < 2e-16 ***
ns(d4$RIDAGEYR, df = 7)6 41.098 5.758 7.137 1.07e-12 ***
ns(d4$RIDAGEYR, df = 7)7 15.992 2.478 6.453 1.19e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.62 on 5683 degrees of freedom
(372 observations deleted due to missingness)
Multiple R-squared: 0.1044, Adjusted R-squared: 0.1033
F-statistic: 94.65 on 7 and 5683 DF, p-value: < 2.2e-16
We can use an anova to compare the models.
anova(lm1, lm2)
Analysis of Variance Table
Model 1: d4$LBXTC ~ d4$RIDAGEYR
Model 2: d4$LBXTC ~ ns(d4$RIDAGEYR, df = 7)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 5689 9674185
2 5683 8922039 6 752146 79.848 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice also that the multiple \(R^2\) went up to about 10%, a pretty substantial increase, suggesting that the curvilinear nature of the relationship is substantial.
The residual standard error also decreased by about 5%.
We have lost the simple explanation that comes from fitting a linear model. We cannot say that your serum cholesterol increases by \(a\) units per year, but that model was wrong, so it really shouldn’t be used.
We can use the regression model we fit to make predictions for any one, and these are substantially more accurate.
Spline Models
Even though the regression summary prints out a different row for each spline term, they are not independent variables, and you need to either retain them all, or retain none of them.
Sex
Next we might want to start to add other variables and explore the different relationships. Let’s consider sex, for now we will leave out age, and just try to understand what happens with sex First let’s fit the model without an intercept
= lm(LBXTC ~ RIAGENDR-1, data=d4)
lm3 summary(lm3)
Call:
lm(formula = LBXTC ~ RIAGENDR - 1, data = d4)
Residuals:
Min 1Q Median 3Q Max
-107.55 -29.55 -3.55 24.21 360.45
Coefficients:
Estimate Std. Error t value Pr(>|t|)
RIAGENDRMale 184.5534 0.7974 231.4 <2e-16 ***
RIAGENDRFemale 189.7914 0.7692 246.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.76 on 5689 degrees of freedom
(372 observations deleted due to missingness)
Multiple R-squared: 0.9526, Adjusted R-squared: 0.9526
F-statistic: 5.722e+04 on 2 and 5689 DF, p-value: < 2.2e-16
Here we can see the mean for males is a bit higher than for females Both are significant, and notice how large the multiple \(R^2\) value is. This is not a very interesting test - we are asking if the mean is zero, which isn’t even physically possible: \[ y_i = \beta_M \cdot 1_{M,i} + \beta_F \cdot 1_{F,i} \] Where \(1_{M,i}\) is 1 if the \(i^{th}\) case is male and zero otherwise, similarly for \(1_{F,i}\) Instead let’s use a model with an intercept to ask if the mean for males is different than that for females. \[ y_i = \beta_0 + \beta_1 \cdot 1_{F,i} \]
This way, \(E[Y|M] = \beta_0\) and \(E[Y|F] = \beta_0 + \beta_1\), and \(\beta_1\) estimates the difference in mean between male and female
= lm(LBXTC ~ RIAGENDR, data=d4)
lm3 summary(lm3)
Call:
lm(formula = LBXTC ~ RIAGENDR, data = d4)
Residuals:
Min 1Q Median 3Q Max
-107.55 -29.55 -3.55 24.21 360.45
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 184.5534 0.7974 231.434 < 2e-16 ***
RIAGENDRFemale 5.2380 1.1080 4.728 2.33e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.76 on 5689 degrees of freedom
(372 observations deleted due to missingness)
Multiple R-squared: 0.003913, Adjusted R-squared: 0.003738
F-statistic: 22.35 on 1 and 5689 DF, p-value: 2.327e-06
Now we see an Intercept term, which is the overall mean. The estimate for females represents how they differ by sex, if it is zero then there is no difference in total cholesterol between men and women.
Looking at more variables
Now we will put together a set of features (variables) that we are interested in. For simplicity we only keep participants for which we have all the data. Here, we use apply
, which apply a function to each row in the dataset. We also could have used the tidyverse functions for handling missing data which we’ve seen previously.
= c("RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "INDFMPIR", "LBDHDD", "LBXGH", "BMXBMI", "LBXTC")
ivars
= d4[,ivars]
d4sub = apply(d4sub, 1, function(x) sum(is.na(x)))
compCases = compCases==0
cC = d4sub[cC,]
d4sub dim(d4sub)
[1] 4592 9
One quick transformation
The variable DMDEDUC2
is a bit too granular for our purposes. Let’s modify it to be less than high school, high school and more than high school
table(d4sub$DMDEDUC2)
Less than 9th grade
523
9-11th grade (Includes 12th grade with no diploma)
506
High school graduate/GED or equivalent
1007
Some college or AA degree
1391
College graduate or above
1164
Don't Know
1
= d4sub$DMDEDUC2
dd
=="Don't Know"] = NA
dd[dd
= ifelse(dd == "Less than 9th grade" | dd =="9-11th grade (Includes 12th grade with no diploma)", "<HS", ifelse(dd == "High school graduate/GED or equivalent", "HS", ">HS" ))
eduS
#stick this into our dataframe
#and drop the NA
$eduS = eduS
d4sub= d4sub[-which(is.na(eduS)), ]
d4sub
table(eduS, dd, useNA = "always")
dd
eduS Less than 9th grade 9-11th grade (Includes 12th grade with no diploma)
<HS 523 506
>HS 0 0
HS 0 0
<NA> 0 0
dd
eduS High school graduate/GED or equivalent Some college or AA degree
<HS 0 0
>HS 0 1391
HS 1007 0
<NA> 0 0
dd
eduS College graduate or above Don't Know <NA>
<HS 0 0 0
>HS 1164 0 0
HS 0 0 0
<NA> 0 0 1
Back to Regression
Let’s run a regression including all of our variables on cholesterol. We can do this by just indicating .
in the formula.
= lm(LBXTC ~ ., data=d4sub)
lmF summary(lmF)
Call:
lm(formula = LBXTC ~ ., data = d4sub)
Residuals:
Min 1Q Median 3Q Max
-123.19 -27.68 -3.40 22.41 362.93
Coefficients: (2 not defined because of singularities)
Estimate Std. Error
(Intercept) 139.06987 5.34623
RIDAGEYR 0.11231 0.03739
RIAGENDRFemale 0.99071 1.29161
RIDRETH1Other Hispanic -2.41290 2.23553
RIDRETH1Non-Hispanic White -1.86248 1.94109
RIDRETH1Non-Hispanic Black -8.60649 2.07247
RIDRETH1Other Race - Including Multi-Racial 1.79765 2.26131
DMDEDUC29-11th grade (Includes 12th grade with no diploma) 1.95438 2.59288
DMDEDUC2High school graduate/GED or equivalent 2.09692 2.32957
DMDEDUC2Some college or AA degree 2.23659 2.30162
DMDEDUC2College graduate or above -1.09357 2.50664
INDFMPIR 1.13163 0.43699
LBDHDD 0.42540 0.03907
LBXGH 2.52364 0.55377
BMXBMI 0.21048 0.09372
eduS>HS NA NA
eduSHS NA NA
t value Pr(>|t|)
(Intercept) 26.013 < 2e-16 ***
RIDAGEYR 3.003 0.00268 **
RIAGENDRFemale 0.767 0.44310
RIDRETH1Other Hispanic -1.079 0.28049
RIDRETH1Non-Hispanic White -0.960 0.33736
RIDRETH1Non-Hispanic Black -4.153 3.34e-05 ***
RIDRETH1Other Race - Including Multi-Racial 0.795 0.42668
DMDEDUC29-11th grade (Includes 12th grade with no diploma) 0.754 0.45104
DMDEDUC2High school graduate/GED or equivalent 0.900 0.36810
DMDEDUC2Some college or AA degree 0.972 0.33123
DMDEDUC2College graduate or above -0.436 0.66266
INDFMPIR 2.590 0.00964 **
LBDHDD 10.890 < 2e-16 ***
LBXGH 4.557 5.32e-06 ***
BMXBMI 2.246 0.02477 *
eduS>HS NA NA
eduSHS NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40.72 on 4576 degrees of freedom
Multiple R-squared: 0.04276, Adjusted R-squared: 0.03983
F-statistic: 14.6 on 14 and 4576 DF, p-value: < 2.2e-16
We see that being Non-hispanic black seems to have a pretty big effect, so we might want to just include that, and group all other ethnicities together. Education level seems to have little to add, let’s drop it to simplify the analysis. It might be good to get a sense of the relationship with the poverty level variable INDFMPIR
.
= ifelse(d4sub$RIDRETH1 == "Non-Hispanic Black", "B", "nonB")
Black = c("RIDAGEYR", "INDFMPIR", "LBDHDD", "LBXGH", "BMXBMI", "LBXTC")
ivars
= cbind(d4sub[,ivars], Black)
d5sub = lm(LBXTC ~ . , data=d5sub)
lmFx summary(lmFx)
Call:
lm(formula = LBXTC ~ ., data = d5sub)
Residuals:
Min 1Q Median 3Q Max
-123.65 -27.94 -3.45 22.47 360.06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 132.78645 5.19610 25.555 < 2e-16 ***
RIDAGEYR 0.09909 0.03618 2.739 0.00619 **
INDFMPIR 0.86401 0.38147 2.265 0.02356 *
LBDHDD 0.43038 0.03662 11.753 < 2e-16 ***
LBXGH 2.60260 0.54925 4.738 2.22e-06 ***
BMXBMI 0.21216 0.09002 2.357 0.01848 *
BlacknonB 7.49101 1.52918 4.899 9.98e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40.73 on 4584 degrees of freedom
Multiple R-squared: 0.0408, Adjusted R-squared: 0.03955
F-statistic: 32.5 on 6 and 4584 DF, p-value: < 2.2e-16
We can compare the two models using the anova
function.
anova(lmFx, lmF)
Exercise: Plot RIDAGEYR versus INDFMPIR. What do you see in the plot? Does anything cause you concern about fitting a linear model?
Missing Indicator Approach
The missing indicator approach may be useful for data where there is a limit of detection, so that values below some lower bound \(\alpha_L\) or above some upper bound \(\alpha_U\) are set to \(\alpha_L\) or to \(\alpha_U\) respectively. A similar approach can be used for the Windsorization used for RIDAGEYR variable, where values over 80 are set to 80. We are not proposing using this for other types of missingness, although there is a righ literature and users may want to explore the broader use of this method. However, it is important to realize that often bias or increased variance may obtain, often dependent on untestable assumptions.
The reason that we believe this approach is appropriate for the cases listed is that we actually did measure the variable, and many over variables on those individuals, but we cannot report the exact value for any individual. Hence, one can interpret the indicator as being some form of average over all individuals affected by the reporting limits. It is probably worth noting that these limitations have implications for predication methods. While one in principle could estimate some outcome for an 88 year old person, the data won’t support that prediction. Instead one should ignore the linear predictor for age and use the indicator variable. Chiou et al (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6812630/) provide some rationale for logistic regression and the references therein point to results for linear regression. Groenwold et al (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414599/) provide advice on more general use of the method to address missing data.
Next let’s try to fix out model to deal with the repeated values in these two variables. Now an important consideration is to try and assess just how to interpret them. For RIDAGEYR the documentation states that anyone over age 80 in the database has their age represented as 80. This is not censoring. The measurements (eg BMI, cholesterol etc.) were all made on a person of some age larger than 80. We just Windsorized their ages, and so these individuals really are not the same as the others, where we are getting accurate age values.
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414599/)
So basically what we want to do is to create a dummy variable for those whose age is reported as 80.
= d5sub$RIDAGEYR == 80
Age80 =cbind(d5sub, Age80)
d6sub
= lm(LBXTC ~ . + Age80 , data=d6sub)
lmFx2 summary(lmFx2)
Call:
lm(formula = LBXTC ~ . + Age80, data = d6sub)
Residuals:
Min 1Q Median 3Q Max
-125.97 -28.09 -3.05 22.34 359.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.95895 5.16754 25.343 < 2e-16 ***
RIDAGEYR 0.23581 0.03996 5.901 3.88e-09 ***
INDFMPIR 0.68993 0.37964 1.817 0.0692 .
LBDHDD 0.42924 0.03638 11.799 < 2e-16 ***
LBXGH 2.29476 0.54709 4.194 2.79e-05 ***
BMXBMI 0.15983 0.08969 1.782 0.0748 .
BlacknonB 7.78276 1.51968 5.121 3.16e-07 ***
Age80TRUE -22.72915 2.90328 -7.829 6.08e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40.46 on 4583 degrees of freedom
Multiple R-squared: 0.05346, Adjusted R-squared: 0.05201
F-statistic: 36.98 on 7 and 4583 DF, p-value: < 2.2e-16
Exercise: What changes do you notice in the model fit? Use the anova function to compare lmFx3
to lmFx2
. How do you interpret the output? Hint: look at INDFMPIR
Exercise: Try to fit missing indicator variables for both of the repeated values in the INDFMPIR variable. Then interpret the output.
= d6sub$INDFMPIR == 5
Pov5 = d6sub$INDFMPIR == 0
Pov0 = cbind(d6sub, Pov5, Pov0)
d7sub
= lm(LBXTC ~ . + Age80 + Pov0 + Pov5, data=d7sub)
lmFx2 summary(lmFx2)
Call:
lm(formula = LBXTC ~ . + Age80 + Pov0 + Pov5, data = d7sub)
Residuals:
Min 1Q Median 3Q Max
-126.04 -28.11 -3.09 22.39 358.65
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 131.44125 5.21065 25.226 < 2e-16 ***
RIDAGEYR 0.23419 0.04003 5.850 5.25e-09 ***
INDFMPIR 0.44100 0.54075 0.816 0.4148
LBDHDD 0.42898 0.03639 11.788 < 2e-16 ***
LBXGH 2.30525 0.54743 4.211 2.59e-05 ***
BMXBMI 0.15911 0.08977 1.773 0.0764 .
BlacknonB 7.76271 1.52024 5.106 3.42e-07 ***
Age80TRUE -22.67132 2.90734 -7.798 7.74e-15 ***
Pov5TRUE 1.35245 2.30269 0.587 0.5570
Pov0TRUE -3.09279 6.30631 -0.490 0.6239
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40.47 on 4581 degrees of freedom
Multiple R-squared: 0.05357, Adjusted R-squared: 0.05171
F-statistic: 28.81 on 9 and 4581 DF, p-value: < 2.2e-16
It seems that some of the apparent effect of INDFMPIR seems to be related to the fact that we are not fitting RIDAGEYR properly.
= lm(LBXTC ~ ns(RIDAGEYR, df=7)+ INDFMPIR+ LBDHDD+LBXGH + BMXBMI + Black + Age80 + Pov0 + Pov5, data=d7sub)
lmFx3 summary(lmFx3)
Call:
lm(formula = LBXTC ~ ns(RIDAGEYR, df = 7) + INDFMPIR + LBDHDD +
LBXGH + BMXBMI + Black + Age80 + Pov0 + Pov5, data = d7sub)
Residuals:
Min 1Q Median 3Q Max
-109.14 -27.01 -3.55 22.15 350.63
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.26835 5.74709 21.449 < 2e-16 ***
ns(RIDAGEYR, df = 7)1 29.34696 3.89739 7.530 6.08e-14 ***
ns(RIDAGEYR, df = 7)2 27.30104 5.01414 5.445 5.46e-08 ***
ns(RIDAGEYR, df = 7)3 44.63075 4.49373 9.932 < 2e-16 ***
ns(RIDAGEYR, df = 7)4 27.23653 4.71406 5.778 8.07e-09 ***
ns(RIDAGEYR, df = 7)5 14.25183 4.46425 3.192 0.001420 **
ns(RIDAGEYR, df = 7)6 30.51098 8.12103 3.757 0.000174 ***
ns(RIDAGEYR, df = 7)7 7.84477 5.21942 1.503 0.132909
INDFMPIR -0.07063 0.53106 -0.133 0.894207
LBDHDD 0.44469 0.03569 12.460 < 2e-16 ***
LBXGH 1.87930 0.53859 3.489 0.000489 ***
BMXBMI 0.06358 0.08824 0.721 0.471190
BlacknonB 8.23587 1.49119 5.523 3.52e-08 ***
Age80TRUE -7.16363 5.57718 -1.284 0.199048
Pov0TRUE -6.82759 6.18622 -1.104 0.269792
Pov5TRUE 0.45875 2.25604 0.203 0.838876
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.62 on 4575 degrees of freedom
Multiple R-squared: 0.09415, Adjusted R-squared: 0.09118
F-statistic: 31.7 on 15 and 4575 DF, p-value: < 2.2e-16
Exercise
In the code below, we drop the terms that were not statistically significant in the model and then compare this smaller model to the larger one, above. Try interpreting the results.
= lm(LBXTC ~ ns(RIDAGEYR, df=7) + LBDHDD+LBXGH+Black, data=d7sub)
lmFx4 summary(lmFx4)
Call:
lm(formula = LBXTC ~ ns(RIDAGEYR, df = 7) + LBDHDD + LBXGH +
Black, data = d7sub)
Residuals:
Min 1Q Median 3Q Max
-107.41 -27.02 -3.52 22.08 350.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 125.09513 5.00892 24.974 < 2e-16 ***
ns(RIDAGEYR, df = 7)1 29.75058 3.88181 7.664 2.18e-14 ***
ns(RIDAGEYR, df = 7)2 27.26624 4.98275 5.472 4.68e-08 ***
ns(RIDAGEYR, df = 7)3 45.51138 4.42996 10.274 < 2e-16 ***
ns(RIDAGEYR, df = 7)4 26.21025 4.58323 5.719 1.14e-08 ***
ns(RIDAGEYR, df = 7)5 17.38565 3.87109 4.491 7.26e-06 ***
ns(RIDAGEYR, df = 7)6 28.13377 7.76676 3.622 0.000295 ***
ns(RIDAGEYR, df = 7)7 2.17338 2.73486 0.795 0.426833
LBDHDD 0.43702 0.03422 12.772 < 2e-16 ***
LBXGH 1.90936 0.53091 3.596 0.000326 ***
BlacknonB 8.09991 1.47684 5.485 4.37e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.61 on 4580 degrees of freedom
Multiple R-squared: 0.09345, Adjusted R-squared: 0.09147
F-statistic: 47.21 on 10 and 4580 DF, p-value: < 2.2e-16
anova(lmFx4, lmFx3)
Analysis of Variance Table
Model 1: LBXTC ~ ns(RIDAGEYR, df = 7) + LBDHDD + LBXGH + Black
Model 2: LBXTC ~ ns(RIDAGEYR, df = 7) + INDFMPIR + LBDHDD + LBXGH + BMXBMI +
Black + Age80 + Pov0 + Pov5
Res.Df RSS Df Sum of Sq F Pr(>F)
1 4580 7186815
2 4575 7181301 5 5513.8 0.7025 0.6215