<- c(50, 60, 65, 82)
weight_g weight_g
[1] 50 60 65 82
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, such as numbers or characters. We can assign a series of values to a vector using the c()
function. For example we can create a vector of animal weights and assign it to a new object weight_g
:
A vector can also contain characters:
The quotes around “dna”, “rna”, etc. are essential here. Without the quotes R will assume there are objects called dna
, rna
and protein
. As these objects don’t exist in R’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a vector. length()
tells you how many elements are in a particular vector:
An important feature of a vector, is that all of the elements are the same type of data. The function class()
indicates the class (the type of element) of an object:
The function str()
provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:
You can use the c()
function to add other elements to your vector:
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
[1] 30 50 60 65 82 90
In the first line, we take the original vector weight_g
, add the value 90
to the end of it, and save the result back into weight_g
. Then we add the value 30
to the beginning, again saving the result back into weight_g
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
An atomic vector is the simplest R data type and is a linear vector of a single type. Above, we saw 2 of the 6 main atomic vector types that R uses: "character"
and "numeric"
(or "double"
). These are the basic building blocks that all R objects are built from. The other 4 atomic vector types are:
"logical"
for TRUE
and FALSE
(the boolean data type)"integer"
for integer numbers (e.g., 2L
, the L
indicates to R that it’s an integer)"complex"
to represent complex numbers with real and imaginary parts (e.g., 1 + 4i
) and that’s all we’re going to say about them"raw"
for bitstreams that we won’t discuss furtherYou can check the type of your vector using the typeof()
function and inputting your vector as the argument.
Vectors are one of the many data structures that R uses. Other important ones are lists (list
), matrices (matrix
), data frames (data.frame
), factors (factor
) and arrays (array
).
We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
R implicitly converts them to all be the same type
Why do you think it happens?
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a common denominator that doesn’t lose any information.
In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
logical → numeric → character ← logical
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
[1] "rna"
[1] "peptide" "rna"
We can also repeat the indices to create an object with more elements than the original one:
[1] "dna" "rna" "peptide" "rna" "dna" "protein"
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Finally, it is also possible to get all the elements of a vector except some specified elements using negative indices:
Another common way of subsetting is by using a logical vector. TRUE
will select the element with the same index, while FALSE
will not:
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:
[1] FALSE FALSE FALSE TRUE TRUE
[1] 54 55
You can combine multiple tests using &
(both conditions are true, AND) or |
(at least one of the conditions is true, OR):
[1] 21 54 55
numeric(0)
Here, <
stands for “less than”, >
for “greater than”, >=
for “greater than or equal to”, and ==
for “equal to”. The double equal sign ==
is a test for numerical equality between the left and right hand sides, and should not be confused with the single =
sign, which performs variable assignment (similar to <-
).
A common task is to search for certain strings in a vector. One could use the “or” operator |
to test for equality to multiple values, but this can quickly become tedious. The function %in%
allows you to test if any of the elements of a search vector are found:
molecules <- c("dna", "rna", "protein", "peptide")
molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna
[1] "dna" "rna"
[1] TRUE TRUE FALSE TRUE
[1] "dna" "rna" "peptide"
Can you figure out why "four" > "five"
returns TRUE
?
It is possible to name each element of a vector. The code chunk below shows an initial vector without any names, how names are set, and retrieved.
NULL
[1] "A" "B" "C" "D" "E"
When a vector has names, it is possible to access elements by their name, in addition to their index.
The materials in this lesson have been adapted from work created by the HBC and Data Carpentry, as well as materials created by Laurent Gatto, Charlotte Soneson, Jenny Drnevich, Robert Castelo, and Kevin Rue-Albert. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.