# Make a vector of 10 0's and 0 empty strings ""
<- numeric(10)
num_vec <- character(10)
char_vec
num_vec
[1] 0 0 0 0 0 0 0 0 0 0
char_vec
[1] "" "" "" "" "" "" "" "" "" ""
Tip: If you are unsure what challenges your project’s data might face, take a look at other papers which use the same or similar data. Do the authors mention any special data cleaning or analysis steps?
How do you plan to deal with this challenge? Are there specialized R packages which can help with this challenge? If you are unsure, how do you plan to learn how to deal with this challenge?
In a later session we’ll see how to perform small simulation experiments. For a simulation experiment, we need simulated data. Since we know the exact properties of a simulated dataset, we can make sure any calculations or models we use in our analysis are correct by first trying them on the simulated dataset.
Create a small dataset (50 or fewer rows) which is similar to your project’s data and if possible contains the challenge you talk about above. Try using some of the functions explained below to make your simulated dataset. Attach your dataset and any code you used to generate your dataset to your solution.
You can create vectors of different types in R:
# Make a vector of 10 0's and 0 empty strings ""
num_vec <- numeric(10)
char_vec <- character(10)
num_vec
[1] 0 0 0 0 0 0 0 0 0 0
[1] "" "" "" "" "" "" "" "" "" ""
And then assign values and create a dataframe:
[1] 6.0 6.0 6.0 8.5 8.5 8.5 8.5 8.5 8.5 8.5
[1] "Male" "Male" "Male" "Male" "Female" "Female" "Female" "Female"
[9] "Female" "Female"
You can generate random numbers using functions like rnorm
:
# Generate 50 random numbers centered around 1000 with a standard deviation of 300
norm_nums <- rnorm(50, 1000, 300)
norm_nums
[1] 1013.3115 1048.9359 836.7754 662.9054 968.8997 633.0454 1225.6682
[8] 953.9327 952.5162 1086.6503 581.0465 1274.6035 952.2714 1425.3069
[15] 799.9064 498.1916 1024.9842 1341.6011 1453.6266 1322.9491 782.1597
[22] 1280.3406 1021.1595 1022.3608 1456.0364 864.6846 532.5921 829.2248
[29] 1040.8391 1356.2277 1062.8606 1364.3579 1115.9920 949.8190 649.9025
[36] 871.3767 1352.1953 1227.1977 1036.2927 1639.8234 907.2287 863.8866
[43] 1158.2418 449.7143 1513.3078 1053.2559 898.7355 1094.4241 757.6526
[50] 985.8059
Another useful function for generating example data is sample
. It selects random items from a vector.
[1] 1225.6682 1341.6011 581.0465 1341.6011 952.2714 1425.3069 907.2287
[8] 1086.6503
[1] 89 56 7 90 44 20
# Randomly replace 25% of norm_nums with NA to simulate missing data
n <- length(norm_nums)
idx <- sample(1:n, round(n * .25), replace = FALSE)
norm_nums[idx] <- NA
norm_nums
[1] 1013.3115 NA 836.7754 662.9054 NA 633.0454 1225.6682
[8] 953.9327 952.5162 1086.6503 NA 1274.6035 952.2714 1425.3069
[15] 799.9064 498.1916 1024.9842 1341.6011 NA NA NA
[22] NA 1021.1595 NA 1456.0364 864.6846 532.5921 829.2248
[29] 1040.8391 1356.2277 1062.8606 NA 1115.9920 949.8190 649.9025
[36] 871.3767 1352.1953 NA 1036.2927 1639.8234 907.2287 863.8866
[43] 1158.2418 449.7143 1513.3078 1053.2559 898.7355 NA 757.6526
[50] NA