Week 2: Markdown and introduction

Introduction

Welcome! Each week, in-class we will be answering questions on the reading and performing analyses based on a code notebook.

Throughout these sessions we will be replicating the analysis from Beheshti et. al. 2021 ¹

This is an analysis of the association between diabetes and dental caries in U.S. adolescents. By the end of the semester we will be able to replicate and extend the analysis in this paper.

Code Notebooks

A code notebook is a document which typically consists of different chunks. Each chunk is either code or text. There are a variety of different notebook platforms for different languages, such as Jupyter notebooks in Python. In R, notebooks have historically been written using R Markdown. However, recently Quarto has been created by Posit (the organization behind RStudio) as an updated version of R Markdown.

R Markdown/Quarto notebooks can be rendered do different formats such as html (a webpage viewable in your web browser), pdf, Word, powerpoint, and others. Their power lies in their ability to make code an output document. We can write our report in the same document we actually perform the analysis, integrating the two together.

Quarto and R Markdown syntax are almost identical. We will mainly be using Quarto in this course.

Code Chunks

You can start a and end code chunk using three back ticks “```”. To have a chunk run as R code, you need to assign the chunk using {r}. You can then specify options for the chunk on subsequent lines using the “hash-pipe” |#. Code chinks have a lot of options, but some of the most important are label, eval, echo, and output.

x <- 5
x

[1] 5

Exercise

Try changing these options in the first of the two chunks below and re-rendering the document. What do each of these arguments do? Pay attention to both chunk’s output.

y = 8
x

[1] 5

[1] 8

x <- x + y

x # Show the value of x

[1] 13

Markdown

Markdown is a language used to quickly create formatted text. It’s great to know as it is used in R Markdown, Quarto, Jupyter, Github documents, and many other places. A pure markdown file has a .md file extension.

You can find a quick guide to markdown here, throughout the course we will see various things markdown can do in the readings and in-class materials.

Quarto Vs. R Markdown

For those familiar with R Markdown, you can find a rundown of changes here.

Due to Quarto being written as an evolution of R Markdown, it also supports most R Markdown syntax. While we could technically mix and match different types of syntax in a single document, this is bad practice. Readable code is consistent. Even if there are multiple ways to do something, it’s best to choose one way and stick with it throughout a code or document. For an example of how passionate programmers can get about consistencies in their code, check out the wikipedia article on indentation style.

Data

Let’s dive into the dataset we will be using today.

We will begin with a mostly processed dataset from NHANES as described in Beheshti 2021. Specifically, this dataset contains the 3346 adolescents recorded in NHANES from 2005 to 2010 with non-missing dental decay data.

You can find the dataset here in the session 2 materials folder. Download it and place it into an appropriate location in your project folder.

We will use the read_csv function for reading in this .csv file.

library(tidyverse)

nhanes_ado <- read_csv("week2Data.csv")

1: This line is loading the tidyverse library, which is actually a family of different packages. Note that this code will fail if you do not have tidyverse installed, which you can do with install.packages("tidyverse")
2: You could change the filename to file.choose() to manually select a file location.

Let’s take a look at the first 200 rows of the dataframe. Note that you will need to have the DT package installed install.packages('DT').

Summarizing data

There are a wide variety of ways to examine our data. Let’s start by using the summary function to get an overview of nhanes_ado.

summary(nhanes_ado)

1: While summary gives us a lot of information, we can’t easily extract or use each piece of information easily. Thus, we need to also be able to individually calculate the values shown.

      ...1         sequence.id      age.years        gender         
 Min.   :   1.0   Min.   :31129   Min.   :13.00   Length:3346       
 1st Qu.: 837.2   1st Qu.:36530   1st Qu.:14.00   Class :character  
 Median :1673.5   Median :43087   Median :16.00   Mode  :character  
 Mean   :1673.5   Mean   :44692   Mean   :15.49                     
 3rd Qu.:2509.8   3rd Qu.:53015   3rd Qu.:17.00                     
 Max.   :3346.0   Max.   :62147   Max.   :18.00                     
                                                                    
  ethnicity          birthplace         family.PIR        dental.decay.present
 Length:3346        Length:3346        Length:3346        Mode :logical       
 Class :character   Class :character   Class :character   FALSE:2808          
 Mode  :character   Mode  :character   Mode  :character   TRUE :538           
                                                                              
                                                                              
                                                                              
                                                                              
 dental.restoration.present plasma.glucose       hba1c             bmi       
 Mode :logical              Min.   : 54.00   Min.   : 4.000   Min.   :13.30  
 FALSE:1632                 1st Qu.: 89.00   1st Qu.: 5.000   1st Qu.:19.91  
 TRUE :1714                 Median : 94.00   Median : 5.200   Median :22.58  
                            Mean   : 94.42   Mean   : 5.188   Mean   :24.23  
                            3rd Qu.: 98.00   3rd Qu.: 5.400   3rd Qu.:27.09  
                            Max.   :527.00   Max.   :14.100   Max.   :62.08  
                            NA's   :1889     NA's   :298      NA's   :25     
   age.cat           hba1c.cat         plasma.glucose.cat family.PIR.lt1 
 Length:3346        Length:3346        Length:3346        Mode :logical  
 Class :character   Class :character   Class :character   FALSE:2209     
 Mode  :character   Mode  :character   Mode  :character   TRUE :914      
                                                          NA's :223      
                                                                         
                                                                         
                                                                         
   diabetes         family.PIR.cat     dental.caries  
 Length:3346        Length:3346        Mode :logical  
 Class :character   Class :character   FALSE:1336     
 Mode  :character   Mode  :character   TRUE :2010

Recall from the reading that we can use filter to get a subset of rows and select to get a subset of columns when we’re using the tidyverse. We can also use the $ or [[]] notations to get columns by name or the [] notation to get rows and columns by index.

Exercise

Below is an incomplete set of code blocks summarizing the nhanes_ado dataset. Try to fill in the missing parts of the blocks. You need to add or modify code wherever you see TODO.

We’ve already looked at the first few rows of the dataset. Now let’s check the last 15 rows using the tail function. You can type ?tail into the R console to check how tail’s arugments work.

tail(nhanes_ado)

# A tibble: 6 × 19
   ...1 sequence.id age.years gender ethnicity             birthplace family.PIR
  <dbl>       <dbl>     <dbl> <chr>  <chr>                 <chr>      <chr>     
1  3341       61184        17 Female Mexican American      Born in 5… 1.69      
2  3342       61409        14 Male   Mexican American      Born in O… Value gre…
3  3343       61410        17 Female Non-Hispanic Black    Born in 5… 4.54      
4  3344       61467        15 Male   Non-Hispanic White    Born in 5… 0.64      
5  3345       61473        13 Female Other Race - Includi… Born in 5… 1.65      
6  3346       61592        18 Female Mexican American      Born in O… 0.84      
# ℹ 12 more variables: dental.decay.present <lgl>,
#   dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
#   bmi <dbl>, age.cat <chr>, hba1c.cat <chr>, plasma.glucose.cat <chr>,
#   family.PIR.lt1 <lgl>, diabetes <chr>, family.PIR.cat <chr>,
#   dental.caries <lgl>

We can display a table showing the min, max, and mean for age and BMI. Replace the 0’s below with the r expressions to correctly fill in the table.

min_age <- min(select(nhanes_ado, age.years)) # Tidyverse way
max_age <- max(nhanes_ado$age.years) # base R way
mean_age <- mean(nhanes_ado$age.years) 
min_bmi <- min(nhanes_ado$bmi, na.rm = TRUE) # BMI has some NA's so we need na.rm
max_bmi <- max(nhanes_ado$bmi, na.rm = TRUE)
mean_bmi <- mean(nhanes_ado$bmi, na.rm = TRUE)

This is a basic markdown table. We will see more advanced ways to make tables later (and we already saw one above using the DT package). Here we are using inline code to fill put variables into the markdown text.

Variable	Minimum	Maximum	Mean
Age	13	18	15.4892409
BMI	13.3	62.08	24.2348871

Practice with Conditionals

While the summarizing above is nice, we need more tools to ask more interesting questions about the data. We can use conditional statements to dive into the data a bit deeper.

In tidyverse we can use dplyr’s filter with conditional statements to see how many rows meet various criteria. We can also sum a conditional statement directly.

filter(nhanes_ado, gender=="Female") %>% nrow() # Tidyverse way

[1] 1643

sum(nhanes_ado$gender == "Female") # Base r way

[1] 1643

Let’s try it now.

Exercise

Fill in the TODO’s below with the correct expressions.

under_15 <- nrow(filter(nhanes_ado, age.years < 15))
underweight <- nrow(filter(nhanes_ado, bmi < 18.5))
overweight <- nrow(filter(nhanes_ado, bmi >= 25 & bmi <= 29.9))
decay_or_restore <- nrow(filter(nhanes_ado, dental.decay.present | dental.restoration.present))
med_glu <- median(nhanes_ado$plasma.glucose, na.rm = TRUE)
white_and_medpg <- nrow(filter(nhanes_ado, 
                               ethnicity == "Non-Hispanic White" & plasma.glucose > med_glu))

This is an example of a markdown list.

There are 1113 samples with an age under 15.
There are 412 samples who are underweight (BMI below 18.5).
There are 613 samples who are of a overweight (BMI between 25 and 29.9).
There are 2010 samples who either have dental caries experience.
There are 200 samples who are Non-Hispanic White and whose plasma glucose is greater than the median.

Footnotes

Beheshti, Mahdieh et al. “Association of Diabetes and Dental Caries Among U.S. Adolescents in the NHANES Dataset.” Pediatric dentistry vol. 43,2 (2021): 123-128.↩︎