1 Introduction to R

The R programming language is a powerful tool for evolutionary biology research, offering a vast array of statistical and visualization tools that are essential for data analysis. With its extensive library of packages, R allows researchers to quickly and easily perform complex statistical analyses, including phylogenetic analyses, genome-wide association studies, and population genetics. R also provides a range of plotting and visualization functions that enable researchers to explore and present their data in a clear and concise manner. Additionally, R can be used in conjunction with other software tools, such as BEAST and IQ-TREE, to facilitate more complex evolutionary analyses. Due to its versatility and user-friendly interface, R has become an increasingly popular choice for evolutionary biologists looking to conduct high-quality research.

1.1 Goals of this lesson

This lesson was designed to be a gentle introduction to R for evolutionary genomics. Thus, we will only cover the basics of the language. An alternative name for this lesson could be “all you need to know about R to start your evolutionary analyses”. At the end of this lesson, you will be able to:

perform simple mathematical operations in R
create and modify objects
use functions and look for help
create vectors and understand the difference between vector types
create data frames
subset vectors and data frames

1.2 Creating objects

The simplest use of R is to perform mathematical operations, which can be simply typed in the console:

2 + 2

[1] 4

4 * 5

[1] 20

20 / 10

[1] 2

5 - 3

[1] 2

However, simple mathematical operations are not very useful. Typically, you would want to assign values to R objects. R objects can be created by using object_name + <- + value. The <- symbol is called the “assignment operator”, and it assigns values to objects. Example:

genome_size <- 2000000

Note that when you assign a value to an object, nothing happens on the screen. In the example above, we created an object named genome_size that contains the value 2000000, but the value is not printed to screen. To show the contents of an object, you must type its name and execute it. For example:

genome_size

[1] 2e+06

Alternatively, we can assign values to objects AND print them at the same time by surrounding the assignment code with parentheses. For example:

(genome_size <- 2000000)

[1] 2e+06

Naming objects

Although you can give your objects whatever name you want, some general best practices include:

Choose a descriptive name, but not too long.
Do not use dots (.).
Use nouns.
Be consistent with your styling. Styles include snake case (e.g., phylo_tree) and camel case (e.g., phyloTree.)

Note that object names cannot start with numbers.

Now that we have an object genome_size with a value, we can use the object to perform mathematical operations. For example, let’s suppose we want to see the genome size in many thousands of base pairs (i.e., kbp, which stands for kilobase pairs):

genome_size / 1000

[1] 2000

We can also change the value of an object by assigning a new value to it. For instance, let’s update the genome_size object with the size in kbp:

genome_size # original object

[1] 2e+06

genome_size <- genome_size / 1000 # modifying the object
genome_size

[1] 2000

As you can see, the object genome_size now contains the value 2000.

1.3 Functions and arguments

Functions are scripts that automate a particular task, and they typically have verbs in their names. They are made available to users as part of packages, and there are several R packages with specialized functions for each field. When you download and install R, it already comes with some packages installed, such as base and utils, but you can also install other packages depending on your needs.

A function takes one or many arguments as input and return something as output. To execute a function in R, you need to write the function name followed by the arguments inside parenthesis. For example, let’s execute the function sqrt(), which takes a numeric value as input and return its square root:

sqrt(100)

[1] 10

You can also pass objects as input to functions. For example, let’s pass our object genome_size to the sqrt() function:

sqrt(genome_size)

[1] 44.72136

To see all the possible arguments a function takes, use the function args() with the function name as input. For example, let’s see all possible arguments for the round() function (which rounds up numbers):

args(round)

function (x, digits = 0) 
NULL

We can see that the round() function can take two arguments as input: x, which is a numeric value, and digits, which indicates how many digits should be used when rounding values. Let’s try to execute the round() function with different options to digits.

pi <- 3.14159
round(pi)

[1] 3

round(pi, digits = 2)

[1] 3.14

round(pi, digits = 3)

[1] 3.142

Finally, if you want to learn more about what a function does and how to use it, you can type the function name preceded by a question mark. For example:

?round

This will open up a help page with detailed information on what the function does, its arguments, and examples.

Exercises

The genome of the plant Arabidopsis thaliana contains 33768 genes, of which 27655 are protein-coding and 6113 encode non-coding RNAs. Calculate the percentage of coding and non-coding genes in this genome using the following steps:

Create a variable named ath_genes that stores the numeric value 33768.
Create 2 objects named n_coding and n_noncoding that store the numeric values 27655 and 6113, respectively.
Create 2 objects named perc_coding and perc_noncoding by dividing n_coding and n_noncoding by ath_genes.
Update the objects by multiplying their contents by 100 (to represent in percentages, from 0 to 100) and rounding the percentages to 1 significant digit only.

Solution

# Store values in objects
ath_genes <- 33768
n_coding <- 27655
n_noncoding <- 6113

# Get percentages
perc_coding <- n_coding / ath_genes
perc_noncoding <- n_noncoding / ath_genes

# Update objects with actual percentages
perc_coding <- round(perc_coding * 100, 1)
perc_noncoding <- round(perc_noncoding * 100, 1)

# Show contents of the objects
perc_coding

[1] 81.9

perc_noncoding

[1] 18.1

1.4 Data structures

Now, we will explore two data structures in R that we will use all the time throughout the lessons (and in our research routine): vectors and data frames.

1.4.1 Vectors

Vector are the most common data structures in R, and they are used to store multiple values. There are 6 kinds of values that can be stored in a vector, but the ones we typically use are:

numeric: numbers.
character: text strings.
logical: either TRUE or FALSE.

The other types are integer, complex, and raw, but they are not important here.

To create a vector, you will use the c() function to combine values. For example:

# Creating a numeric vector
genome_sizes <- c(33768, 52872, 39756)
genome_sizes

[1] 33768 52872 39756

# Creating a character vector
species <- c("Arabidopsis", "soybean", "maize")
species

[1] "Arabidopsis" "soybean"     "maize"

You can inspect vectors with the functions:

class(): shows the class of a vector.
length(): shows the number of elements in a vector.
str(): summarizes the structure of a vector.

class(genome_sizes)

[1] "numeric"

length(genome_sizes)

[1] 3

str(genome_sizes)

 num [1:3] 33768 52872 39756

You can also combine two vectors:

species1 <- c("soybean", "maize")
species2 <- c("cotton", "bean")
combined_species <- c(species1, species2)
combined_species

[1] "soybean" "maize"   "cotton"  "bean"

NOTE: Vectors can only store values of the same type (i.e., character, numeric, etc). If you try to create a vector that contains values of different types, R converts them to all be the same type. This conversion of one class to another is called coercion. For example:

# Creating a vector with numeric and character values
c_vector <- c("a", "b", 1, "d")
c_vector

[1] "a" "b" "1" "d"

In the example above, R converted the numeric value 1 to a character.

1.4.2 Data frames

Data frames is the name R uses to call tables. To create a new data frame, you will use the data.frame() function. Each column of a data frame is a vector, so you can create a data frame by passing each vector to each column. For example, let’s recreate the genome_sizes and species vectors we created in the previous section and store them in columns genome_size and species of a data frame:

# Create vectors again
genome_sizes <- c(33768, 52872, 39756)
species <- c("Arabidopsis", "soybean", "maize")

# Create a data frame with columns `genome_size` and `species`
genome_size_df <- data.frame(
    genome_size = genome_sizes,
    species = species
)

genome_size_df

  genome_size     species
1       33768 Arabidopsis
2       52872     soybean
3       39756       maize

We can also create the vectors inside the data.frame() function itself:

genome_size_df <- data.frame(
    genome_size = c(33768, 52872, 39756),
    species = c("Arabidopsis", "soybean", "maize")
)

genome_size_df

  genome_size     species
1       33768 Arabidopsis
2       52872     soybean
3       39756       maize

To add a column to an existing data frame, you have to add $ followed by the new column name in front of the object name. For example, let’s add a column named is_model to the genome_size_df data frame that indicates whether or not a species is a model species:

# Add `is_model` column
genome_size_df$is_model <- c(TRUE, FALSE, FALSE)
genome_size_df

  genome_size     species is_model
1       33768 Arabidopsis     TRUE
2       52872     soybean    FALSE
3       39756       maize    FALSE

Finally, to inspect a data frame, you can use the following functions:

dim(): shows the dimensions of the data frame (i.e., number of rows and columns, respectively).
nrow(): shows the number of rows in a data frame.
ncol(): shows the number of columns in a data frame.
head(): shows the first 6 rows of a data frame.
tail(): shows the last 6 rows of a data frame.
names(): shows the column names.
rownames(): shows the row names
str(): summarizes the structure of a data frame.
summary(): shows summary statistics for each column.

Exercises

The plants Brussels sprout, grapevine and apple belong to the families Brassicaceae, Vitaceae, and Rosaceae. Create a data frame named species_families with 2 columns named species and family representing such information.

Solution

species_families <- data.frame(
    species = c("Brussels sprout", "grapevine", "apple"),
    family = c("Brassicaceae", "Vitaceae", "Rosaceae")
)

species_families

          species       family
1 Brussels sprout Brassicaceae
2       grapevine     Vitaceae
3           apple     Rosaceae

When you install R, it already comes with some example data sets. One of them is airquality, a data frame containing New York air quality measurements from May to September 1973. Inspect this data frame and answer the following questions:
- How many rows are there?
- How many columns are there?
- What are the column names?
- What are the classes of each column?

Solution

str(airquality)

'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

1.5 Indexing and subsetting

Here, you will learn how to extract specific elements of vectors and data frames, which is called subsetting.

1.5.1 Vectors

To subset vectors, you need to pass the index of the element you want to extract inside square brackets ([]). If you want to extract multiple elements, you have to pass a vector of indices inside the square brackets. For example:

plants <- c("rice", "maize", "duckweed")

# Extract 1st element
plants[1]

[1] "rice"

# Extract 1st and 2nd element
plants[c(1, 2)]

[1] "rice"  "maize"

# Extract 3rd and 2nd element (in this order)
plants[c(3, 2)]

[1] "duckweed" "maize"

You can also remove a given element by adding a minus (-) symbol before the index:

# Get all elements, except the 1st
plants[-1]

[1] "maize"    "duckweed"

# Get all elements, except the 1st and 3rd
plants[-c(1, 3)]

[1] "maize"

Another very common way to subset vectors is by using logical vectors. When using logical vectors, TRUE and FALSE will indicate whether to extract or not extract the element. For example, let’s use logical vectors to subset the plants vector we created above:

plants

[1] "rice"     "maize"    "duckweed"

# Extract 1st element, do not extract 2nd and 3rd elements
plants[c(TRUE, FALSE, FALSE)]

[1] "rice"

# Extract 1st and 3rd elements, do not extract 2nd
plants[c(TRUE, FALSE, TRUE)]

[1] "rice"     "duckweed"

Now, you might be thinking: using logical vectors seems complicated. Why would someone do this instead of using indices?. The answer is conditional subsetting. In R, you can use logical expressions that return TRUE or FALSE to extract elements. For example, let’s create a vector of genome sizes for fictional species and check if they more than 20,000 genes:

ngenes <- c(52000, 35000, 18000, 17000, 22000, 11000, 13000)

# Logical expression: is the element >= 20000?
ngenes > 20000

[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE

You can see that the logical expression returns a vector of TRUEs and FALSEs. Since TRUE means extract and FALSE means do not extract when subsetting, we can use logical expressions to subset vectors as follows:

# Extract elements with >=20000 genes
ngenes[ngenes > 20000]

[1] 52000 35000 22000

You can combine multiple tests in logical expressions using & (which means AND, both conditions are true) and | (which means OR, at least one of the conditions is true).

# Extract elements with number of genes between 20000 and 40000
ngenes[ngenes > 20000 & ngenes < 40000]

[1] 35000 22000

# Extract elements with 13000 or 11000 genes
ngenes[ngenes == 13000 | ngenes == 11000]

[1] 11000 13000

You can also subset a vector based on the presence of some pre-defined elements using the %in% operator. For example:

animals <- c("dog", "cat", "rat", "pig", "horse")

animals[animals %in% c("cat", "dog")]

[1] "dog" "cat"

1.5.2 Data frames

Subsetting data frames in very similar to subsetting vectors. The only difference is that data frames have 2 dimensions (rows and columns), while vectors have only 1 dimension. Thus, when you subset a data frame, you have to specify which dimension you want to use. For instance, if you execute vec[2], you will extract the 2nd element of the vector vec. However, if you pass the index 2 to a data frame, you can mean either the 2nd row or the second column. To subset data frames, you will use commas (,) inside square brackets to distinguish rows (which come before the comma) from columns (which come after the comma. For example:

# See the first 6 rows the `airquality` data frame
head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

# Extract the element in row 1, column 2
airquality[1, 2]

[1] 190

# Extract rows 1 to 5, column 1 - note: `1:5` is the same as `c(1, 2, 3, 4, 5)`
airquality[1:5, 1]

[1] 41 36 12 18 NA

To extract all rows or all columns, leave the corresponding field empty:

# Extract row 2, all columns
airquality[2, ]

  Ozone Solar.R Wind Temp Month Day
2    36     118    8   72     5   2

# Extract column 2, all rows
airquality[, 2]

  [1] 190 118 149 313  NA  NA 299  99  19 194  NA 256 290 274  65 334 307  78
 [19] 322  44   8 320  25  92  66 266  NA  13 252 223 279 286 287 242 186 220
 [37] 264 127 273 291 323 259 250 148 332 322 191 284  37 120 137 150  59  91
 [55] 250 135 127  47  98  31 138 269 248 236 101 175 314 276 267 272 175 139
 [73] 264 175 291  48 260 274 285 187 220   7 258 295 294 223  81  82 213 275
 [91] 253 254  83  24  77  NA  NA  NA 255 229 207 222 137 192 273 157  64  71
[109]  51 115 244 190 259  36 255 212 238 215 153 203 225 237 188 167 197 183
[127] 189  95  92 252 220 230 259 236 259 238  24 112 237 224  27 238 201 238
[145]  14 139  49  20 193 145 191 131 223

You can also subset columns based on their names:

airquality$Month

  [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
 [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
 [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[149] 9 9 9 9 9

airquality[, "Month"] # same thing

  [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
 [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
 [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[149] 9 9 9 9 9

As we did with vectors, we can subset the data frame using logical expressions:

# Extract rows for which column "Month" is equal to 5
airquality[airquality$Month == 5, ]

   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
16    14     334 11.5   64     5  16
17    34     307 12.0   66     5  17
18     6      78 18.4   57     5  18
19    30     322 11.5   68     5  19
20    11      44  9.7   62     5  20
21     1       8  9.7   59     5  21
22    11     320 16.6   73     5  22
23     4      25  9.7   61     5  23
24    32      92 12.0   61     5  24
25    NA      66 16.6   57     5  25
26    NA     266 14.9   58     5  26
27    NA      NA  8.0   57     5  27
28    23      13 12.0   67     5  28
29    45     252 14.9   81     5  29
30   115     223  5.7   79     5  30
31    37     279  7.4   76     5  31

# Extract rows with "Temp" >90, then show only columns "Month" and "Day"
airquality[airquality$Temp > 90, c("Month", "Day")]

    Month Day
42      6  11
43      6  12
69      7   8
70      7   9
75      7  14
102     8  10
120     8  28
121     8  29
122     8  30
123     8  31
124     9   1
125     9   2
126     9   3
127     9   4

Exercises

Subset the airquality data set to answer the following questions:

How many days (rows) observations have ozone levels >40 ppb?
How many days have wind speed between 10 and 15 mph?
What are the minimum and maximum temperature?
How many days have solar radiation <100 Langleys?

Solution

# Q1: Ozone levels >40 ppb
nrow(airquality[airquality$Ozone > 40, ])

[1] 82

# Q2: Wind speed between 10 and 15 mph
nrow(airquality[airquality$Wind >= 10 & airquality$Wind <=15, ])

[1] 62

# Q3: Minimum and maximum temperature
summary(airquality)

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

# Q4: Solar radiation <100 Langleys
nrow(airquality[airquality$Solar.R < 100, ])

[1] 41