<- 5
x x
[1] 5
Welcome! Each week, in-class we will be answering questions on the reading and performing analyses based on a code notebook.
Throughout these sessions we will be replicating the analysis from Beheshti et. al. 20211
This is an analysis of the association between diabetes and dental caries in U.S. adolescents. By the end of the semester we will be able to replicate and extend the analysis in this paper.
A code notebook is a document which typically consists of different chunks. Each chunk is either code or text. There are a variety of different notebook platforms for different languages, such as Jupyter notebooks in Python. In R, notebooks have historically been written using R Markdown. However, recently Quarto has been created by Posit (the organization behind RStudio) as an updated version of R Markdown.
R Markdown/Quarto notebooks can be rendered do different formats such as html (a webpage viewable in your web browser), pdf, Word, powerpoint, and others. Their power lies in their ability to make code an output document. We can write our report in the same document we actually perform the analysis, integrating the two together.
Quarto and R Markdown syntax are almost identical. We will mainly be using Quarto in this course.
You can start a and end code chunk using three back ticks “```”. To have a chunk run as R code, you need to assign the chunk using {r}
. You can then specify options for the chunk on subsequent lines using the “hash-pipe” |#
. Code chinks have a lot of options, but some of the most important are label
, eval
, echo
, and output
.
Markdown is a language used to quickly create formatted text. It’s great to know as it is used in R Markdown, Quarto, Jupyter, Github documents, and many other places. A pure markdown file has a .md
file extension.
You can find a quick guide to markdown here, throughout the course we will see various things markdown can do in the readings and in-class materials.
For those familiar with R Markdown, you can find a rundown of changes here.
Due to Quarto being written as an evolution of R Markdown, it also supports most R Markdown syntax. While we could technically mix and match different types of syntax in a single document, this is bad practice. Readable code is consistent. Even if there are multiple ways to do something, it’s best to choose one way and stick with it throughout a code or document. For an example of how passionate programmers can get about consistencies in their code, check out the wikipedia article on indentation style.
Let’s dive into the dataset we will be using today.
We will begin with a mostly processed dataset from NHANES as described in Beheshti 2021. Specifically, this dataset contains the 3346 adolescents recorded in NHANES from 2005 to 2010 with non-missing dental decay data.
You can find the dataset here in the session 2 materials folder. Download it and place it into an appropriate location in your project folder.
We will use the read_csv
function for reading in this .csv
file.
tidyverse
library, which is actually a family of different packages. Note that this code will fail if you do not have tidyverse
installed, which you can do with install.packages("tidyverse")
file.choose()
to manually select a file location.
Let’s take a look at the first 200 rows of the dataframe. Note that you will need to have the DT
package installed install.packages('DT')
.
There are a wide variety of ways to examine our data. Let’s start by using the summary
function to get an overview of nhanes_ado
.
summary
gives us a lot of information, we can’t easily extract or use each piece of information easily. Thus, we need to also be able to individually calculate the values shown.
...1 sequence.id age.years gender
Min. : 1.0 Min. :31129 Min. :13.00 Length:3346
1st Qu.: 837.2 1st Qu.:36530 1st Qu.:14.00 Class :character
Median :1673.5 Median :43087 Median :16.00 Mode :character
Mean :1673.5 Mean :44692 Mean :15.49
3rd Qu.:2509.8 3rd Qu.:53015 3rd Qu.:17.00
Max. :3346.0 Max. :62147 Max. :18.00
ethnicity birthplace family.PIR dental.decay.present
Length:3346 Length:3346 Length:3346 Mode :logical
Class :character Class :character Class :character FALSE:2808
Mode :character Mode :character Mode :character TRUE :538
dental.restoration.present plasma.glucose hba1c bmi
Mode :logical Min. : 54.00 Min. : 4.000 Min. :13.30
FALSE:1632 1st Qu.: 89.00 1st Qu.: 5.000 1st Qu.:19.91
TRUE :1714 Median : 94.00 Median : 5.200 Median :22.58
Mean : 94.42 Mean : 5.188 Mean :24.23
3rd Qu.: 98.00 3rd Qu.: 5.400 3rd Qu.:27.09
Max. :527.00 Max. :14.100 Max. :62.08
NA's :1889 NA's :298 NA's :25
age.cat hba1c.cat plasma.glucose.cat family.PIR.lt1
Length:3346 Length:3346 Length:3346 Mode :logical
Class :character Class :character Class :character FALSE:2209
Mode :character Mode :character Mode :character TRUE :914
NA's :223
diabetes family.PIR.cat dental.caries
Length:3346 Length:3346 Mode :logical
Class :character Class :character FALSE:1336
Mode :character Mode :character TRUE :2010
Recall from the reading that we can use filter
to get a subset of rows and select
to get a subset of columns when we’re using the tidyverse. We can also use the $
or [[]]
notations to get columns by name or the []
notation to get rows and columns by index.
Below is an incomplete set of code blocks summarizing the nhanes_ado
dataset. Try to fill in the missing parts of the blocks. You need to add or modify code wherever you see TODO
.
We’ve already looked at the first few rows of the dataset. Now let’s check the last 15 rows using the tail function. You can type ?tail
into the R console to check how tail’s arugments work.
# A tibble: 6 × 19
...1 sequence.id age.years gender ethnicity birthplace family.PIR
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 3341 61184 17 Female Mexican American Born in 5… 1.69
2 3342 61409 14 Male Mexican American Born in O… Value gre…
3 3343 61410 17 Female Non-Hispanic Black Born in 5… 4.54
4 3344 61467 15 Male Non-Hispanic White Born in 5… 0.64
5 3345 61473 13 Female Other Race - Includi… Born in 5… 1.65
6 3346 61592 18 Female Mexican American Born in O… 0.84
# ℹ 12 more variables: dental.decay.present <lgl>,
# dental.restoration.present <lgl>, plasma.glucose <dbl>, hba1c <dbl>,
# bmi <dbl>, age.cat <chr>, hba1c.cat <chr>, plasma.glucose.cat <chr>,
# family.PIR.lt1 <lgl>, diabetes <chr>, family.PIR.cat <chr>,
# dental.caries <lgl>
We can display a table showing the min, max, and mean for age and BMI. Replace the 0’s below with the r expressions to correctly fill in the table.
min_age <- min(select(nhanes_ado, age.years)) # Tidyverse way
max_age <- max(nhanes_ado$age.years) # base R way
mean_age <- mean(nhanes_ado$age.years)
min_bmi <- min(nhanes_ado$bmi, na.rm = TRUE) # BMI has some NA's so we need na.rm
max_bmi <- max(nhanes_ado$bmi, na.rm = TRUE)
mean_bmi <- mean(nhanes_ado$bmi, na.rm = TRUE)
This is a basic markdown table. We will see more advanced ways to make tables later (and we already saw one above using the DT
package). Here we are using inline code to fill put variables into the markdown text.
Variable | Minimum | Maximum | Mean |
---|---|---|---|
Age | 13 | 18 | 15.4892409 |
BMI | 13.3 | 62.08 | 24.2348871 |
While the summarizing above is nice, we need more tools to ask more interesting questions about the data. We can use conditional statements to dive into the data a bit deeper.
In tidyverse we can use dplyr’s filter
with conditional statements to see how many rows meet various criteria. We can also sum a conditional statement directly.
[1] 1643
[1] 1643
Let’s try it now.
Fill in the TODO’s below with the correct expressions.
under_15 <- nrow(filter(nhanes_ado, age.years < 15))
underweight <- nrow(filter(nhanes_ado, bmi < 18.5))
overweight <- nrow(filter(nhanes_ado, bmi >= 25 & bmi <= 29.9))
decay_or_restore <- nrow(filter(nhanes_ado, dental.decay.present | dental.restoration.present))
med_glu <- median(nhanes_ado$plasma.glucose, na.rm = TRUE)
white_and_medpg <- nrow(filter(nhanes_ado,
ethnicity == "Non-Hispanic White" & plasma.glucose > med_glu))
This is an example of a markdown list.
Beheshti, Mahdieh et al. “Association of Diabetes and Dental Caries Among U.S. Adolescents in the NHANES Dataset.” Pediatric dentistry vol. 43,2 (2021): 123-128.↩︎