Analysis Practice

Data analysis exercise 1: Due November 8

What is at least one challenge you expect to encounter when analyzing your project’s data? Examples of challenges include:

A high percent of missing data.
Groups of interest are of uneven size.
A specialized type of data which is difficult to deal with (such as geographic or time-series data).
Unreliable data (values may be incorrect).
Data has to be combined from many different sources.
How the data was collected or recorded changed over time.

Tip: If you are unsure what challenges your project’s data might face, take a look at other papers which use the same or similar data. Do the authors mention any special data cleaning or analysis steps?

How do you plan to deal with this challenge? Are there specialized R packages which can help with this challenge? If you are unsure, how do you plan to learn how to deal with this challenge?
In a later session we’ll see how to perform small simulation experiments. For a simulation experiment, we need simulated data. Since we know the exact properties of a simulated dataset, we can make sure any calculations or models we use in our analysis are correct by first trying them on the simulated dataset.

Create a small dataset (50 or fewer rows) which is similar to your project’s data and if possible contains the challenge you talk about above. Try using some of the functions explained below to make your simulated dataset. Attach your dataset and any code you used to generate your dataset to your solution.

You can create vectors of different types in R:

# Make a vector of 10 0's and 0 empty strings ""
num_vec <- numeric(10)
char_vec <- character(10)

num_vec

 [1] 0 0 0 0 0 0 0 0 0 0

char_vec

 [1] "" "" "" "" "" "" "" "" "" ""

And then assign values and create a dataframe:

num_vec[1:3] <- 6
num_vec[4:10] <- 8.5
num_vec

 [1] 6.0 6.0 6.0 8.5 8.5 8.5 8.5 8.5 8.5 8.5

char_vec[1:5] <- "Male"
char_vec[5:10] <- "Female"
char_vec

 [1] "Male"   "Male"   "Male"   "Male"   "Female" "Female" "Female" "Female"
 [9] "Female" "Female"

df <- data.frame(char_vec, num_vec)

You can generate random numbers using functions like rnorm:

# Generate 50 random numbers centered around 1000 with a standard deviation of 300
norm_nums <- rnorm(50, 1000, 300)
norm_nums

 [1] 1013.3115 1048.9359  836.7754  662.9054  968.8997  633.0454 1225.6682
 [8]  953.9327  952.5162 1086.6503  581.0465 1274.6035  952.2714 1425.3069
[15]  799.9064  498.1916 1024.9842 1341.6011 1453.6266 1322.9491  782.1597
[22] 1280.3406 1021.1595 1022.3608 1456.0364  864.6846  532.5921  829.2248
[29] 1040.8391 1356.2277 1062.8606 1364.3579 1115.9920  949.8190  649.9025
[36]  871.3767 1352.1953 1227.1977 1036.2927 1639.8234  907.2287  863.8866
[43] 1158.2418  449.7143 1513.3078 1053.2559  898.7355 1094.4241  757.6526
[50]  985.8059

Another useful function for generating example data is sample. It selects random items from a vector.

# Select a random item 8 times from norm_nums
sample(norm_nums, 8, replace = TRUE)

[1] 1225.6682 1341.6011  581.0465 1341.6011  952.2714 1425.3069  907.2287
[8] 1086.6503

# Select 6 different random whole numbers between 1 and 100
sample(1:100, 6, replace=FALSE)

[1] 89 56  7 90 44 20

# Randomly replace 25% of norm_nums with NA to simulate missing data
n <- length(norm_nums)
idx <- sample(1:n, round(n * .25), replace = FALSE)
norm_nums[idx] <- NA
norm_nums

 [1] 1013.3115        NA  836.7754  662.9054        NA  633.0454 1225.6682
 [8]  953.9327  952.5162 1086.6503        NA 1274.6035  952.2714 1425.3069
[15]  799.9064  498.1916 1024.9842 1341.6011        NA        NA        NA
[22]        NA 1021.1595        NA 1456.0364  864.6846  532.5921  829.2248
[29] 1040.8391 1356.2277 1062.8606        NA 1115.9920  949.8190  649.9025
[36]  871.3767 1352.1953        NA 1036.2927 1639.8234  907.2287  863.8866
[43] 1158.2418  449.7143 1513.3078 1053.2559  898.7355        NA  757.6526
[50]        NA

Data analysis exercise 2: Due November 15

What is a data visualization or table you plan to create as a part of your analysis?
Using your simulated dataset, your project data, or the NHANES data from class, create an example of the visualization. Try to get it as close to ‘publication ready’ as possible in terms of legends, axis labels, colors, text size, etc.
Export your figure and attach it to your solution.

Data analysis exercise 3: Due November 29

Name either 2 strategies to deal with missing data, or 2 strategies to deal with duplicate data.
Will the data you’re using for your project contain duplicate and/or missing data?
Write R code on simulated dataset, your project data, or the NHANES data from class to demonstrate one of the solutions you mentioned.

Data analysis exercise 4: Due December 6

Choose a relationship between 2 or more variables in your project’s data you wish to explore the relationship between.
What test or model would you use to explore that relationship?
Write R code using to run your chosen model or test on your simulated dataset, your project data, or the NHANES data from class.
How will you evaluate your model’s success for failure?