Vectors, Types, and Conditional Statements

Vectors and data types

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, such as numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g:

weight_g <- c(50, 60, 65, 82)
weight_g

[1] 50 60 65 82

A vector can also contain characters:

molecules <- c("dna", "rna", "protein")
molecules

[1] "dna"     "rna"     "protein"

The quotes around “dna”, “rna”, etc. are essential here. Without the quotes R will assume there are objects called dna, rna and protein. As these objects don’t exist in R’s memory, there will be an error message.

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(weight_g)

[1] 4

length(molecules)

[1] 3

An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates the class (the type of element) of an object:

class(weight_g)

[1] "numeric"

class(molecules)

[1] "character"

The function str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:

str(weight_g)

 num [1:4] 50 60 65 82

str(molecules)

 chr [1:3] "dna" "rna" "protein"

You can use the c() function to add other elements to your vector:

weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g

[1] 30 50 60 65 82 90

In the first line, we take the original vector weight_g, add the value 90 to the end of it, and save the result back into weight_g. Then we add the value 30 to the beginning, again saving the result back into weight_g.

We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.

An atomic vector is the simplest R data type and is a linear vector of a single type. Above, we saw 2 of the 6 main atomic vector types that R uses: "character" and "numeric" (or "double"). These are the basic building blocks that all R objects are built from. The other 4 atomic vector types are:

"logical" for TRUE and FALSE (the boolean data type)
"integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
"complex" to represent complex numbers with real and imaginary parts (e.g., 1 + 4i) and that’s all we’re going to say about them
"raw" for bitstreams that we won’t discuss further

You can check the type of your vector using the typeof() function and inputting your vector as the argument.

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), factors (factor) and arrays (array).

Challenge:

We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?

Solution

R implicitly converts them to all be the same type

Challenge:

What will happen in each of these examples? (hint: use class() to check the data type of your objects and type in their names to see what happens):

num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE, FALSE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")

Solution

class(num_char)

[1] "character"

num_char

[1] "1" "2" "3" "a"

class(num_logical)

[1] "numeric"

num_logical

[1] 1 2 3 1 0

class(char_logical)

[1] "character"

char_logical

[1] "a"    "b"    "c"    "TRUE"

class(tricky)

[1] "character"

tricky

[1] "1" "2" "3" "4"

Challenge:

Why do you think it happens?

Solution

Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a common denominator that doesn’t lose any information.

Challenge:

How many values in combined_logical are "TRUE" (as a character) in the following example:

num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
combined_logical <- c(num_logical, char_logical)

Solution

Only one. There is no memory of past data types, and the coercion happens the first time the vector is evaluated. Therefore, the TRUE in num_logical gets converted into a 1 before it gets converted into "1" in combined_logical.

combined_logical

[1] "1"    "2"    "3"    "1"    "a"    "b"    "c"    "TRUE"

Challenge:

In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?

Solution

logical → numeric → character ← logical

Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

molecules <- c("dna", "rna", "peptide", "protein")
molecules[2]

[1] "rna"

molecules[c(3, 2)]

[1] "peptide" "rna"

We can also repeat the indices to create an object with more elements than the original one:

more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)]
more_molecules

[1] "dna"     "rna"     "peptide" "rna"     "dna"     "protein"

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Finally, it is also possible to get all the elements of a vector except some specified elements using negative indices:

molecules ## all molecules

[1] "dna"     "rna"     "peptide" "protein"

molecules[-1] ## all but the first one

[1] "rna"     "peptide" "protein"

molecules[-c(1, 3)] ## all but 1st/3rd ones

[1] "rna"     "protein"

molecules[c(-1, -3)] ## all but 1st/3rd ones

[1] "rna"     "protein"

Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

[1] 21 39 54

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:

## will return logicals with TRUE for the indices that meet
## the condition
weight_g > 50

[1] FALSE FALSE FALSE  TRUE  TRUE

## so we can use this to select only the values above 50
weight_g[weight_g > 50]

[1] 54 55

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

weight_g[weight_g < 30 | weight_g > 50]

[1] 21 54 55

weight_g[weight_g >= 30 & weight_g == 21]

numeric(0)

Here, < stands for “less than”, > for “greater than”, >= for “greater than or equal to”, and == for “equal to”. The double equal sign == is a test for numerical equality between the left and right hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

molecules <- c("dna", "rna", "protein", "peptide")
molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna

[1] "dna" "rna"

molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")

[1]  TRUE  TRUE FALSE  TRUE

molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")]

[1] "dna"     "rna"     "peptide"

Challenge:

Can you figure out why "four" > "five" returns TRUE?

Solution

"four" > "five"

[1] TRUE

When using > or < on strings, R compares their alphabetical order. Here "four" comes after "five", and therefore is greater than it.

Names

It is possible to name each element of a vector. The code chunk below shows an initial vector without any names, how names are set, and retrieved.

x <- c(1, 5, 3, 5, 10)
names(x) ## no names

NULL

names(x) <- c("A", "B", "C", "D", "E")
names(x) ## now we have names

[1] "A" "B" "C" "D" "E"

When a vector has names, it is possible to access elements by their name, in addition to their index.

x[c(1, 3)]

A C 
1 3

x[c("A", "C")]

A C 
1 3

The materials in this lesson have been adapted from work created by the HBC and Data Carpentry, as well as materials created by Laurent Gatto, Charlotte Soneson, Jenny Drnevich, Robert Castelo, and Kevin Rue-Albert. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.