2 + 2
[1] 4
4 * 5
[1] 20
20 / 10
[1] 2
5 - 3
[1] 2
The R programming language is a powerful tool for evolutionary biology research, offering a vast array of statistical and visualization tools that are essential for data analysis. With its extensive library of packages, R allows researchers to quickly and easily perform complex statistical analyses, including phylogenetic analyses, genome-wide association studies, and population genetics. R also provides a range of plotting and visualization functions that enable researchers to explore and present their data in a clear and concise manner. Additionally, R can be used in conjunction with other software tools, such as BEAST and IQ-TREE, to facilitate more complex evolutionary analyses. Due to its versatility and user-friendly interface, R has become an increasingly popular choice for evolutionary biologists looking to conduct high-quality research.
This lesson was designed to be a gentle introduction to R for evolutionary genomics. Thus, we will only cover the basics of the language. An alternative name for this lesson could be “all you need to know about R to start your evolutionary analyses”. At the end of this lesson, you will be able to:
The simplest use of R is to perform mathematical operations, which can be simply typed in the console:
2 + 2
[1] 4
4 * 5
[1] 20
20 / 10
[1] 2
5 - 3
[1] 2
However, simple mathematical operations are not very useful. Typically, you would want to assign values to R objects. R objects can be created by using object_name
+ <-
+ value
. The <-
symbol is called the “assignment operator”, and it assigns values to objects. Example:
<- 2000000 genome_size
Note that when you assign a value to an object, nothing happens on the screen. In the example above, we created an object named genome_size
that contains the value 2000000
, but the value is not printed to screen. To show the contents of an object, you must type its name and execute it. For example:
genome_size
[1] 2e+06
Alternatively, we can assign values to objects AND print them at the same time by surrounding the assignment code with parentheses. For example:
<- 2000000) (genome_size
[1] 2e+06
Although you can give your objects whatever name you want, some general best practices include:
.
).phylo_tree
) and camel case (e.g., phyloTree
.)Note that object names cannot start with numbers.
Now that we have an object genome_size
with a value, we can use the object to perform mathematical operations. For example, let’s suppose we want to see the genome size in many thousands of base pairs (i.e., kbp, which stands for kilobase pairs):
/ 1000 genome_size
[1] 2000
We can also change the value of an object by assigning a new value to it. For instance, let’s update the genome_size
object with the size in kbp:
# original object genome_size
[1] 2e+06
<- genome_size / 1000 # modifying the object
genome_size genome_size
[1] 2000
As you can see, the object genome_size
now contains the value 2000
.
Functions are scripts that automate a particular task, and they typically have verbs in their names. They are made available to users as part of packages, and there are several R packages with specialized functions for each field. When you download and install R, it already comes with some packages installed, such as base and utils, but you can also install other packages depending on your needs.
A function takes one or many arguments as input and return something as output. To execute a function in R, you need to write the function name followed by the arguments inside parenthesis. For example, let’s execute the function sqrt()
, which takes a numeric value as input and return its square root:
sqrt(100)
[1] 10
You can also pass objects as input to functions. For example, let’s pass our object genome_size
to the sqrt()
function:
sqrt(genome_size)
[1] 44.72136
To see all the possible arguments a function takes, use the function args()
with the function name as input. For example, let’s see all possible arguments for the round()
function (which rounds up numbers):
args(round)
function (x, digits = 0)
NULL
We can see that the round()
function can take two arguments as input: x
, which is a numeric value, and digits
, which indicates how many digits should be used when rounding values. Let’s try to execute the round()
function with different options to digits
.
<- 3.14159
pi round(pi)
[1] 3
round(pi, digits = 2)
[1] 3.14
round(pi, digits = 3)
[1] 3.142
Finally, if you want to learn more about what a function does and how to use it, you can type the function name preceded by a question mark. For example:
?round
This will open up a help page with detailed information on what the function does, its arguments, and examples.
The genome of the plant Arabidopsis thaliana contains 33768 genes, of which 27655 are protein-coding and 6113 encode non-coding RNAs. Calculate the percentage of coding and non-coding genes in this genome using the following steps:
ath_genes
that stores the numeric value 33768.n_coding
and n_noncoding
that store the numeric values 27655 and 6113, respectively.perc_coding
and perc_noncoding
by dividing n_coding
and n_noncoding
by ath_genes
.Now, we will explore two data structures in R that we will use all the time throughout the lessons (and in our research routine): vectors and data frames.
Vector are the most common data structures in R, and they are used to store multiple values. There are 6 kinds of values that can be stored in a vector, but the ones we typically use are:
numeric
: numbers.character
: text strings.logical
: either TRUE
or FALSE
.The other types are integer
, complex
, and raw
, but they are not important here.
To create a vector, you will use the c()
function to combine values. For example:
# Creating a numeric vector
<- c(33768, 52872, 39756)
genome_sizes genome_sizes
[1] 33768 52872 39756
# Creating a character vector
<- c("Arabidopsis", "soybean", "maize")
species species
[1] "Arabidopsis" "soybean" "maize"
You can inspect vectors with the functions:
class()
: shows the class of a vector.length()
: shows the number of elements in a vector.str()
: summarizes the structure of a vector.class(genome_sizes)
[1] "numeric"
length(genome_sizes)
[1] 3
str(genome_sizes)
num [1:3] 33768 52872 39756
You can also combine two vectors:
<- c("soybean", "maize")
species1 <- c("cotton", "bean")
species2 <- c(species1, species2)
combined_species combined_species
[1] "soybean" "maize" "cotton" "bean"
NOTE: Vectors can only store values of the same type (i.e., character, numeric, etc). If you try to create a vector that contains values of different types, R converts them to all be the same type. This conversion of one class to another is called coercion. For example:
# Creating a vector with numeric and character values
<- c("a", "b", 1, "d")
c_vector c_vector
[1] "a" "b" "1" "d"
In the example above, R converted the numeric value 1 to a character.
Data frames is the name R uses to call tables. To create a new data frame, you will use the data.frame()
function. Each column of a data frame is a vector, so you can create a data frame by passing each vector to each column. For example, let’s recreate the genome_sizes
and species
vectors we created in the previous section and store them in columns genome_size and species of a data frame:
# Create vectors again
<- c(33768, 52872, 39756)
genome_sizes <- c("Arabidopsis", "soybean", "maize")
species
# Create a data frame with columns `genome_size` and `species`
<- data.frame(
genome_size_df genome_size = genome_sizes,
species = species
)
genome_size_df
genome_size species
1 33768 Arabidopsis
2 52872 soybean
3 39756 maize
We can also create the vectors inside the data.frame()
function itself:
<- data.frame(
genome_size_df genome_size = c(33768, 52872, 39756),
species = c("Arabidopsis", "soybean", "maize")
)
genome_size_df
genome_size species
1 33768 Arabidopsis
2 52872 soybean
3 39756 maize
To add a column to an existing data frame, you have to add $
followed by the new column name in front of the object name. For example, let’s add a column named is_model
to the genome_size_df
data frame that indicates whether or not a species is a model species:
# Add `is_model` column
$is_model <- c(TRUE, FALSE, FALSE)
genome_size_df genome_size_df
genome_size species is_model
1 33768 Arabidopsis TRUE
2 52872 soybean FALSE
3 39756 maize FALSE
Finally, to inspect a data frame, you can use the following functions:
dim()
: shows the dimensions of the data frame (i.e., number of rows and columns, respectively).nrow()
: shows the number of rows in a data frame.ncol()
: shows the number of columns in a data frame.head()
: shows the first 6 rows of a data frame.tail()
: shows the last 6 rows of a data frame.names()
: shows the column names.rownames()
: shows the row namesstr()
: summarizes the structure of a data frame.summary()
: shows summary statistics for each column.species_families
with 2 columns named species and family representing such information.When you install R, it already comes with some example data sets. One of them is airquality
, a data frame containing New York air quality measurements from May to September 1973. Inspect this data frame and answer the following questions:
Here, you will learn how to extract specific elements of vectors and data frames, which is called subsetting.
To subset vectors, you need to pass the index of the element you want to extract inside square brackets ([]
). If you want to extract multiple elements, you have to pass a vector of indices inside the square brackets. For example:
<- c("rice", "maize", "duckweed")
plants
# Extract 1st element
1] plants[
[1] "rice"
# Extract 1st and 2nd element
c(1, 2)] plants[
[1] "rice" "maize"
# Extract 3rd and 2nd element (in this order)
c(3, 2)] plants[
[1] "duckweed" "maize"
You can also remove a given element by adding a minus (-
) symbol before the index:
# Get all elements, except the 1st
-1] plants[
[1] "maize" "duckweed"
# Get all elements, except the 1st and 3rd
-c(1, 3)] plants[
[1] "maize"
Another very common way to subset vectors is by using logical vectors. When using logical vectors, TRUE
and FALSE
will indicate whether to extract or not extract the element. For example, let’s use logical vectors to subset the plants
vector we created above:
plants
[1] "rice" "maize" "duckweed"
# Extract 1st element, do not extract 2nd and 3rd elements
c(TRUE, FALSE, FALSE)] plants[
[1] "rice"
# Extract 1st and 3rd elements, do not extract 2nd
c(TRUE, FALSE, TRUE)] plants[
[1] "rice" "duckweed"
Now, you might be thinking: using logical vectors seems complicated. Why would someone do this instead of using indices?. The answer is conditional subsetting. In R, you can use logical expressions that return TRUE
or FALSE
to extract elements. For example, let’s create a vector of genome sizes for fictional species and check if they more than 20,000 genes:
<- c(52000, 35000, 18000, 17000, 22000, 11000, 13000)
ngenes
# Logical expression: is the element >= 20000?
> 20000 ngenes
[1] TRUE TRUE FALSE FALSE TRUE FALSE FALSE
You can see that the logical expression returns a vector of TRUE
s and FALSE
s. Since TRUE
means extract and FALSE
means do not extract when subsetting, we can use logical expressions to subset vectors as follows:
# Extract elements with >=20000 genes
> 20000] ngenes[ngenes
[1] 52000 35000 22000
You can combine multiple tests in logical expressions using &
(which means AND, both conditions are true) and |
(which means OR, at least one of the conditions is true).
# Extract elements with number of genes between 20000 and 40000
> 20000 & ngenes < 40000] ngenes[ngenes
[1] 35000 22000
# Extract elements with 13000 or 11000 genes
== 13000 | ngenes == 11000] ngenes[ngenes
[1] 11000 13000
You can also subset a vector based on the presence of some pre-defined elements using the %in%
operator. For example:
<- c("dog", "cat", "rat", "pig", "horse")
animals
%in% c("cat", "dog")] animals[animals
[1] "dog" "cat"
Subsetting data frames in very similar to subsetting vectors. The only difference is that data frames have 2 dimensions (rows and columns), while vectors have only 1 dimension. Thus, when you subset a data frame, you have to specify which dimension you want to use. For instance, if you execute vec[2]
, you will extract the 2nd element of the vector vec
. However, if you pass the index 2 to a data frame, you can mean either the 2nd row or the second column. To subset data frames, you will use commas (,
) inside square brackets to distinguish rows (which come before the comma) from columns (which come after the comma. For example:
# See the first 6 rows the `airquality` data frame
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
# Extract the element in row 1, column 2
1, 2] airquality[
[1] 190
# Extract rows 1 to 5, column 1 - note: `1:5` is the same as `c(1, 2, 3, 4, 5)`
1:5, 1] airquality[
[1] 41 36 12 18 NA
To extract all rows or all columns, leave the corresponding field empty:
# Extract row 2, all columns
2, ] airquality[
Ozone Solar.R Wind Temp Month Day
2 36 118 8 72 5 2
# Extract column 2, all rows
2] airquality[,
[1] 190 118 149 313 NA NA 299 99 19 194 NA 256 290 274 65 334 307 78
[19] 322 44 8 320 25 92 66 266 NA 13 252 223 279 286 287 242 186 220
[37] 264 127 273 291 323 259 250 148 332 322 191 284 37 120 137 150 59 91
[55] 250 135 127 47 98 31 138 269 248 236 101 175 314 276 267 272 175 139
[73] 264 175 291 48 260 274 285 187 220 7 258 295 294 223 81 82 213 275
[91] 253 254 83 24 77 NA NA NA 255 229 207 222 137 192 273 157 64 71
[109] 51 115 244 190 259 36 255 212 238 215 153 203 225 237 188 167 197 183
[127] 189 95 92 252 220 230 259 236 259 238 24 112 237 224 27 238 201 238
[145] 14 139 49 20 193 145 191 131 223
You can also subset columns based on their names:
$Month airquality
[1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
[38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
[75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[149] 9 9 9 9 9
"Month"] # same thing airquality[,
[1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
[38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
[75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[149] 9 9 9 9 9
As we did with vectors, we can subset the data frame using logical expressions:
# Extract rows for which column "Month" is equal to 5
$Month == 5, ] airquality[airquality
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20
21 1 8 9.7 59 5 21
22 11 320 16.6 73 5 22
23 4 25 9.7 61 5 23
24 32 92 12.0 61 5 24
25 NA 66 16.6 57 5 25
26 NA 266 14.9 58 5 26
27 NA NA 8.0 57 5 27
28 23 13 12.0 67 5 28
29 45 252 14.9 81 5 29
30 115 223 5.7 79 5 30
31 37 279 7.4 76 5 31
# Extract rows with "Temp" >90, then show only columns "Month" and "Day"
$Temp > 90, c("Month", "Day")] airquality[airquality
Month Day
42 6 11
43 6 12
69 7 8
70 7 9
75 7 14
102 8 10
120 8 28
121 8 29
122 8 30
123 8 31
124 9 1
125 9 2
126 9 3
127 9 4
Subset the airquality
data set to answer the following questions: