Week 5: Exploratory analysis and basic tests

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)

Data

We can load our processed NHANES data from last week.

nhanes <- read_csv("../session4/nhanes_processed.csv")
Rows: 3346 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): RIAGENDR, RIDRETH1, DMDBORN, age.cat, plasma.glucose.cat, hba1c.ca...
dbl (14): SEQN, RIDAGEYR, INDFMPIR, SDMVPSU, SDMVSTRA, WTINT2YR, WTMEC2YR, L...
lgl  (4): OHXDECAY, OHXREST, dental.caries, in.study

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Making tables

For some data, we might want to visualize things with a table instead of a plot. We can use the table builtin function to easily tabulate age by dental caries.

table(nhanes[,c("RIDAGEYR","dental.caries")])
        dental.caries
RIDAGEYR FALSE TRUE
      13   247  290
      14   262  314
      15   222  336
      16   230  360
      17   204  354
      18   171  356

However, we can also take advantage of packages written for this purpose. For instance, we can quickly make a frequency table using the proc_freq function in flextable.

# install.packages("flextable")
library(flextable)

Attaching package: 'flextable'
The following object is masked from 'package:purrr':

    compose
proc_freq(nhanes,"RIDAGEYR","dental.caries")

RIDAGEYR

dental.caries

FALSE

TRUE

Total

13

Count

247 (7.4%)

290 (8.7%)

537 (16.0%)

Mar. pct (1)

18.5% ; 46.0%

14.4% ; 54.0%

14

Count

262 (7.8%)

314 (9.4%)

576 (17.2%)

Mar. pct

19.6% ; 45.5%

15.6% ; 54.5%

15

Count

222 (6.6%)

336 (10.0%)

558 (16.7%)

Mar. pct

16.6% ; 39.8%

16.7% ; 60.2%

16

Count

230 (6.9%)

360 (10.8%)

590 (17.6%)

Mar. pct

17.2% ; 39.0%

17.9% ; 61.0%

17

Count

204 (6.1%)

354 (10.6%)

558 (16.7%)

Mar. pct

15.3% ; 36.6%

17.6% ; 63.4%

18

Count

171 (5.1%)

356 (10.6%)

527 (15.8%)

Mar. pct

12.8% ; 32.4%

17.7% ; 67.6%

Total

Count

1,336 (39.9%)

2,010 (60.1%)

3,346 (100.0%)

(1) Columns and rows percentages

Or we can use the gtsummary package to summarize everything.

library(gtsummary)

Attaching package: 'gtsummary'
The following objects are masked from 'package:flextable':

    as_flextable, continuous_summary
select(nhanes,dental.caries, age.cat, bmi.cat, hba1c.cat, birthplace, diabetes) %>%
  tbl_summary(by = dental.caries)
Characteristic FALSE, N = 1,3361 TRUE, N = 2,0101
age.cat

    13-15 731 (55%) 940 (47%)
    16-18 605 (45%) 1,070 (53%)
bmi.cat

    Normal 886 (67%) 1,277 (64%)
    Obese 207 (16%) 333 (17%)
    Overweight 237 (18%) 381 (19%)
    Unknown 6 19
hba1c.cat

    <5.7% 1,129 (93%) 1,718 (93%)
    >=5.7% and <6.5% 81 (6.7%) 120 (6.5%)
    Unknown 126 172
birthplace

    Outside the US 155 (12%) 251 (12%)
    Within the US 1,180 (88%) 1,758 (88%)
    Unknown 1 1
diabetes

    diabetic 7 (0.6%) 13 (0.7%)
    nondiabetic 1,031 (85%) 1,566 (85%)
    prediabetic 173 (14%) 262 (14%)
    Unknown 125 169
1 n (%)

Summary statistics or visualization?

Summary statistics can compress datasets of any size into just a few numbers. This can be vital for understanding large amounts of data, but can also lead to misconceptions. Let’s take a look at a classic example, Anscombe’s quartet.

library(datasets)
datasets::anscombe
   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

First, we need to restructure this data so we can easily group it by each of the different 4 datasets included. This uses the pivot_longer command to reshape the data, which we will not be covering in-class but is mentioned in the readings.

# Let's restructure this data to have 3 variables, x, y, and the dataset number it belongs to
long_ans <- anscombe |>
  pivot_longer(
    everything(),
    cols_vary = "slowest",
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)"
  )

Now we can visualize the 4 datasets easily using a facet_wrap in ggplot:

ggplot(long_ans,aes(x=x,y=y,group=set))+geom_point()+facet_wrap(~set)

Exercise

Calculate the mean and variance for x and y, and the correlation between x and y of each of the 4 sets of data.

long_ans |>
  group_by(set) |>
  summarise(mean_x = mean(x), mean_y = mean(y), variance_x = var(x), variance_y = var(y), correlation = cor(x, y))
# A tibble: 4 × 6
  set   mean_x mean_y variance_x variance_y correlation
  <chr>  <dbl>  <dbl>      <dbl>      <dbl>       <dbl>
1 1          9   7.50         11       4.13       0.816
2 2          9   7.50         11       4.13       0.816
3 3          9   7.5          11       4.12       0.816
4 4          9   7.50         11       4.12       0.817

Datasaurus

A similar example is in the datasauRus package in R.

library(datasauRus)

datasaurus_dozen |> 
  group_by(dataset) |> 
  summarize(
    mean_x    = mean(x),
    mean_y    = mean(y),
    std_dev_x = sd(x),
    std_dev_y = sd(y),
    corr_x_y  = cor(x, y)
  )
# A tibble: 13 × 6
   dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
   <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
 1 away         54.3   47.8      16.8      26.9  -0.0641
 2 bullseye     54.3   47.8      16.8      26.9  -0.0686
 3 circle       54.3   47.8      16.8      26.9  -0.0683
 4 dino         54.3   47.8      16.8      26.9  -0.0645
 5 dots         54.3   47.8      16.8      26.9  -0.0603
 6 h_lines      54.3   47.8      16.8      26.9  -0.0617
 7 high_lines   54.3   47.8      16.8      26.9  -0.0685
 8 slant_down   54.3   47.8      16.8      26.9  -0.0690
 9 slant_up     54.3   47.8      16.8      26.9  -0.0686
10 star         54.3   47.8      16.8      26.9  -0.0630
11 v_lines      54.3   47.8      16.8      26.9  -0.0694
12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
13 x_shape      54.3   47.8      16.8      26.9  -0.0656
library(ggplot2)
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset))+
  geom_point()+
  theme_void()+
  theme(legend.position = "none")+
  facet_wrap(~dataset, ncol = 3)

Performing simple tests

Let’s start by looking at the relationship between ethnicity and dental caries using the table command.

table(nhanes$RIDRETH1, nhanes$dental.caries)
                                
                                 FALSE TRUE
  Mexican American                 306  610
  Non-Hispanic Black               391  574
  Non-Hispanic White               461  563
  Other Hispanic                   114  156
  Other Race - Including Multi-R    64  107

We can group and summarize the data to take a look at percents by group instead:

eth_ado <- nhanes %>%
  group_by(RIDRETH1, dental.caries) %>%
  summarise(count = n()) %>%
  mutate(perc = count/sum(count))
`summarise()` has grouped output by 'RIDRETH1'. You can override using the
`.groups` argument.
eth_ado
# A tibble: 10 × 4
# Groups:   RIDRETH1 [5]
   RIDRETH1                       dental.caries count  perc
   <chr>                          <lgl>         <int> <dbl>
 1 Mexican American               FALSE           306 0.334
 2 Mexican American               TRUE            610 0.666
 3 Non-Hispanic Black             FALSE           391 0.405
 4 Non-Hispanic Black             TRUE            574 0.595
 5 Non-Hispanic White             FALSE           461 0.450
 6 Non-Hispanic White             TRUE            563 0.550
 7 Other Hispanic                 FALSE           114 0.422
 8 Other Hispanic                 TRUE            156 0.578
 9 Other Race - Including Multi-R FALSE            64 0.374
10 Other Race - Including Multi-R TRUE            107 0.626

We could also visualize the different distributions:

ggplot(filter(eth_ado, dental.caries == TRUE), aes(x = RIDRETH1, y = perc*100)) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    labs(x = "Ethnicity", y = "Percent Dental Caries")

We can also perform a Chi-squared test to examine the relationship between the variables. We either print the result directly or store it in a variable and examine it’s different components.

res <- chisq.test(table(nhanes$RIDRETH1, nhanes$dental.caries))

res$method
[1] "Pearson's Chi-squared test"
res$p.value
[1] 9.922401e-06
res$statistic
X-squared 
 28.48993 

Calculating statistics by group

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

nhanes %>%
  group_by(birthplace)
# A tibble: 3,346 × 32
# Groups:   birthplace [3]
    SEQN RIDAGEYR RIAGENDR RIDRETH1   DMDBORN INDFMPIR SDMVPSU SDMVSTRA WTINT2YR
   <dbl>    <dbl> <chr>    <chr>      <chr>      <dbl>   <dbl>    <dbl>    <dbl>
 1 31129       15 Male     Non-Hispa… Born i…     2.71       1       51    5317.
 2 31133       16 Female   Non-Hispa… Born i…     5          1       51    5635.
 3 31139       18 Female   Other His… Born i…     4.91       1       44    5676.
 4 31141       16 Male     Mexican A… Born i…     5          2       48    6332.
 5 31148       16 Female   Non-Hispa… Born i…     1.06       1       50   20536.
 6 31163       17 Male     Non-Hispa… Born i…     2.23       1       56    6425.
 7 31166       15 Male     Non-Hispa… Born i…     0.75       1       55   30180.
 8 31170       14 Male     Other Rac… Born i…     3          2       46   34800.
 9 31177       18 Female   Non-Hispa… Born i…     1.12       2       49    8767.
10 31178       14 Female   Mexican A… Born i…     4.07       1       57    6190.
# ℹ 3,336 more rows
# ℹ 23 more variables: WTMEC2YR <dbl>, OHXDECAY <lgl>, OHXREST <lgl>,
#   LBXGLU <dbl>, WTSAF2YR <dbl>, LBXGH <dbl>, BMXBMI <dbl>, Begin.Year <dbl>,
#   EndYear <dbl>, age.cat <chr>, plasma.glucose.cat <chr>, hba1c.cat <chr>,
#   bmi.cat <chr>, family.PIR.cat <chr>, birthplace <chr>, dental.caries <lgl>,
#   diabetes <chr>, DIQ010 <chr>, DID040 <dbl>, DIQ160 <chr>, DIQ050 <chr>,
#   diabetes.DIQ <chr>, in.study <lgl>

The group_by() function doesn’t perform any data processing, it groups the data into subsets: in the example above, our initial tibble of 3346 observations is split into 3 groups based on the birthplace variable.

We could similarly decide to group the tibble by sex:

nhanes %>%
  group_by(RIAGENDR)
# A tibble: 3,346 × 32
# Groups:   RIAGENDR [2]
    SEQN RIDAGEYR RIAGENDR RIDRETH1   DMDBORN INDFMPIR SDMVPSU SDMVSTRA WTINT2YR
   <dbl>    <dbl> <chr>    <chr>      <chr>      <dbl>   <dbl>    <dbl>    <dbl>
 1 31129       15 Male     Non-Hispa… Born i…     2.71       1       51    5317.
 2 31133       16 Female   Non-Hispa… Born i…     5          1       51    5635.
 3 31139       18 Female   Other His… Born i…     4.91       1       44    5676.
 4 31141       16 Male     Mexican A… Born i…     5          2       48    6332.
 5 31148       16 Female   Non-Hispa… Born i…     1.06       1       50   20536.
 6 31163       17 Male     Non-Hispa… Born i…     2.23       1       56    6425.
 7 31166       15 Male     Non-Hispa… Born i…     0.75       1       55   30180.
 8 31170       14 Male     Other Rac… Born i…     3          2       46   34800.
 9 31177       18 Female   Non-Hispa… Born i…     1.12       2       49    8767.
10 31178       14 Female   Mexican A… Born i…     4.07       1       57    6190.
# ℹ 3,336 more rows
# ℹ 23 more variables: WTMEC2YR <dbl>, OHXDECAY <lgl>, OHXREST <lgl>,
#   LBXGLU <dbl>, WTSAF2YR <dbl>, LBXGH <dbl>, BMXBMI <dbl>, Begin.Year <dbl>,
#   EndYear <dbl>, age.cat <chr>, plasma.glucose.cat <chr>, hba1c.cat <chr>,
#   bmi.cat <chr>, family.PIR.cat <chr>, birthplace <chr>, dental.caries <lgl>,
#   diabetes <chr>, DIQ010 <chr>, DID040 <dbl>, DIQ160 <chr>, DIQ050 <chr>,
#   diabetes.DIQ <chr>, in.study <lgl>

Here our initial tibble of 3346 observations is split into 0 groups based on the sex variable.

Once the data has been grouped, subsequent operations will be applied on each group independently.

The summarise() function

group_by() is often used together with summarise(), which collapses each group into a single-row summary of that group.

group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean bmi by birthplace:

nhanes %>%
  group_by(birthplace) %>%
  summarise(mean_bmi = mean(BMXBMI, na.rm = TRUE))
# A tibble: 3 × 2
  birthplace     mean_bmi
  <chr>             <dbl>
1 Outside the US     23.7
2 Within the US      24.3
3 <NA>               23.4

But we can can also group by multiple columns:

nhanes %>%
  group_by(BMXBMI, RIDRETH1) %>%
  summarise(mean_bmi = mean(BMXBMI, na.rm = TRUE))
`summarise()` has grouped output by 'BMXBMI'. You can override using the
`.groups` argument.
# A tibble: 2,616 × 3
# Groups:   BMXBMI [1,590]
   BMXBMI RIDRETH1                       mean_bmi
    <dbl> <chr>                             <dbl>
 1   13.3 Other Race - Including Multi-R     13.3
 2   13.4 Non-Hispanic Black                 13.4
 3   14.1 Non-Hispanic White                 14.1
 4   14.1 Mexican American                   14.1
 5   14.4 Other Hispanic                     14.4
 6   14.4 Non-Hispanic White                 14.4
 7   14.4 Other Race - Including Multi-R     14.4
 8   14.6 Non-Hispanic White                 14.6
 9   14.6 Non-Hispanic White                 14.6
10   14.6 Mexican American                   14.6
# ℹ 2,606 more rows

Once the data is grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add columns indicating the mean and median plasma glucose (LBXGLU) by sex and ethnicity:

nhanes %>%
  group_by(RIDAGEYR, age.cat, RIDRETH1) %>%
  summarise(mean_plasma_glucose = mean(LBXGLU, na.rm = TRUE),
            median_plasma_glucose = median(LBXGLU, na.rm = TRUE))
`summarise()` has grouped output by 'RIDAGEYR', 'age.cat'. You can override
using the `.groups` argument.
# A tibble: 30 × 5
# Groups:   RIDAGEYR, age.cat [6]
   RIDAGEYR age.cat RIDRETH1           mean_plasma_glucose median_plasma_glucose
      <dbl> <chr>   <chr>                            <dbl>                 <dbl>
 1       13 13-15   Mexican American                  98.5                  95  
 2       13 13-15   Non-Hispanic Black                93.4                  93.5
 3       13 13-15   Non-Hispanic White                94.6                  94  
 4       13 13-15   Other Hispanic                    97.5                  99  
 5       13 13-15   Other Race - Incl…                95.8                  94  
 6       14 13-15   Mexican American                  97.5                  94  
 7       14 13-15   Non-Hispanic Black                91.9                  91  
 8       14 13-15   Non-Hispanic White                94.4                  94  
 9       14 13-15   Other Hispanic                    96.0                  94  
10       14 13-15   Other Race - Incl…                95.8                  97  
# ℹ 20 more rows
Challenge

Calculate the mean hba1c of participants born in the USA by age.

nhanes %>%
  filter(birthplace == 'Within the US') %>%
  group_by(age.cat) %>%
  summarise(mean_hba1c = mean(LBXGH, na.rm = TRUE))
# A tibble: 2 × 2
  age.cat mean_hba1c
  <chr>        <dbl>
1 13-15         5.21
2 16-18         5.16

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each age, we would do:

nhanes %>%
    count(RIDAGEYR)
# A tibble: 6 × 2
  RIDAGEYR     n
     <dbl> <int>
1       13   537
2       14   576
3       15   558
4       16   590
5       17   558
6       18   527

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, nhanes %>% count(age.years) is equivalent to:

nhanes %>%
    group_by(age.cat) %>%
    summarise(n = n())
# A tibble: 2 × 2
  age.cat     n
  <chr>   <int>
1 13-15    1671
2 16-18    1675

The previous example shows the use of count() to count the number of rows/observations for one factor (i.e., infection). If we wanted to count a combination of factors, such as age and sex, we would specify the first and the second factor as the arguments of count():

nhanes %>%
    count(RIDAGEYR, RIAGENDR)
# A tibble: 12 × 3
   RIDAGEYR RIAGENDR     n
      <dbl> <chr>    <int>
 1       13 Female     268
 2       13 Male       269
 3       14 Female     290
 4       14 Male       286
 5       15 Female     271
 6       15 Male       287
 7       16 Female     288
 8       16 Male       302
 9       17 Female     263
10       17 Male       295
11       18 Female     263
12       18 Male       264
Challenge
  1. How many participants have a non-empty plasma glucose value for each age (using RIDAGEYR)?
  2. Use group_by() and summarise() to evaluate BMI (BMXBMI) by ethnicity. Which ethnicity has the highest mean BMI?
#1
nhanes %>% filter(!is.na(LBXGLU)) %>%
  group_by(RIDAGEYR) %>%
  count()
# A tibble: 6 × 2
# Groups:   RIDAGEYR [6]
  RIDAGEYR     n
     <dbl> <int>
1       13   227
2       14   232
3       15   259
4       16   258
5       17   240
6       18   241
#2
nhanes %>% group_by(RIDRETH1) %>%
  summarise(mean_BMI = mean(BMXBMI, na.rm = TRUE)) %>%
  arrange(desc(mean_BMI))
# A tibble: 5 × 2
  RIDRETH1                       mean_BMI
  <chr>                             <dbl>
1 Non-Hispanic Black                 24.9
2 Mexican American                   24.7
3 Other Hispanic                     24.2
4 Other Race - Including Multi-R     23.4
5 Non-Hispanic White                 23.3

Analyzing the adolescent data

As we just saw, we can use group_by and summarize to see summary statistics by various subgroups. For instance, let’s take a look at dental caries and birthplace.

nhanes |>
  group_by(dental.caries, birthplace) |>
  summarize(n = n())
`summarise()` has grouped output by 'dental.caries'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   dental.caries [2]
  dental.caries birthplace         n
  <lgl>         <chr>          <int>
1 FALSE         Outside the US   155
2 FALSE         Within the US   1180
3 FALSE         <NA>               1
4 TRUE          Outside the US   251
5 TRUE          Within the US   1758
6 TRUE          <NA>               1
Summarizing

Calculate the percent of adolescents with dental caries for each birthplace category.

nhanes |>
  group_by(birthplace) |> # Group by birthplace
  summarize(n = n(), caries = sum(dental.caries)) |>
  mutate(perc_caries = caries/n)
# A tibble: 3 × 4
  birthplace         n caries perc_caries
  <chr>          <int>  <int>       <dbl>
1 Outside the US   406    251       0.618
2 Within the US   2938   1758       0.598
3 <NA>               2      1       0.5  

Insulin

Let’s look at the demographics of who’s taking insulin by birthplace and family income.

nhanes |>
  filter(diabetes == "diabetic") |>
  group_by(birthplace, family.PIR.cat) |> 
  summarize(n = n(), insulin = sum((DIQ050) == "Yes")) |>
  mutate(perc_insulin = insulin/n)
`summarise()` has grouped output by 'birthplace'. You can override using the
`.groups` argument.
# A tibble: 5 × 5
# Groups:   birthplace [2]
  birthplace     family.PIR.cat     n insulin perc_insulin
  <chr>          <chr>          <int>   <int>        <dbl>
1 Outside the US <1                 2       1        0.5  
2 Outside the US >=1                1       0        0    
3 Within the US  <1                 6       2        0.333
4 Within the US  >=1               10       2        0.2  
5 Within the US  <NA>               1       1        1    

We can also take a look at ethnicity. Let’s split this into two steps. First we make a table containing summarized data.

insulin_tbl <- nhanes |>
  filter(diabetes == "diabetic") |>
  group_by(RIDRETH1, birthplace) |> 
  summarize(n = n(), insulin = sum((DIQ050) == "Yes"), na.rm=TRUE) |>
  mutate(perc_insulin = insulin/n)
`summarise()` has grouped output by 'RIDRETH1'. You can override using the
`.groups` argument.

And now we can plot it.

ggplot(insulin_tbl, aes(x = birthplace, y = perc_insulin)) + 
  geom_bar(stat = "identity") + 
  facet_grid(~RIDRETH1)

Unfortunately, it turns out insulin data is quite sparse for adolescents in the dataset. Thus, we can’t really get a good picture of insulin use within our study population. This is probably why this data was not included in the paper’s analysis.