3  Distance-based phylogeny inference

The simplest way to infer a phylogeny is to use pairwise distances between sequences, which are calculated from Multiple Sequence Alignments (MSA). Pairwise distances are typically stored in distance matrices, which are used to infer a phylogenetic tree using algorithms such as UPGMA or neighbor-joining. In this lesson, we will guide on you on how to infer a phylogeny from sequences.

3.1 Goals of this lesson

In this lesson, you will learn to:

  • perform MSA
  • calculate distances from MSA
  • infer phylogenetic trees from distance matrices

3.2 Multiple Sequence Alignment (MSA)

There are several algorithms to perform an MSA, such as ClustalW (Thompson, Higgins, and Gibson 1994), ClustalOmega (Sievers et al. 2011), MUSCLE (Edgar 2004), and MAFFT (Katoh and Standley 2013). Comparing them and deciding which one is the best is beyond the scope of this lesson. Here, we will perform MSA with the Bioconductor package msa (Bodenhofer et al. 2015), which provides native implementations of the ClustalW, ClustalOmega, and MUSCLE algorithms.

msa comes with an example FASTA file containing protein sequences of hemoglobin subunit alpha 1 in multiple species. Let’s start by reading the example file:

library(msa)
library(Biostrings)

# Path to example file
fasta_path <- system.file("examples", "HemoglobinAA.fasta", package="msa")
fasta_path
[1] "/home/faalm/R/x86_64-pc-linux-gnu-library/4.3/msa/examples/HemoglobinAA.fasta"
# Read example file
hemoglobin_seqs <- readAAStringSet(fasta_path)
hemoglobin_seqs
AAStringSet object of length 17:
     width seq                                              names               
 [1]   141 VLSPADKTNVKAAWGKVGAHAGE...AVHASLDKFLASVSTVLTSKYR HBA1_Homo_sapiens
 [2]   141 VLSPADKSNVKAAWGKVGGHAGE...AVHASLDKFLASVSTVLTSKYR HBA1_Macaca_mulatta
 [3]   141 MLTDAEKKEVTALWGKAAGHGEE...SAHAAMDKFLSKVATVLTSKYR HBA1_Ornithorhync...
 [4]   141 VLSAADKGNVKAAWGKVGGHAAE...AVHASLDKFLANVSTVLTSKYR HBA1_Bos_taurus
 [5]   141 VLSAADKTNVKAAWSKVGGNSGA...EIHASLDKFLALLSTVLTSKYR HBA1_Monodelphis_...
 ...   ... ...
[13]   141 VLSPADKTNVKGTWSKIGNHSAE...SVHASLDKFLASVSTVLTSKYR HBA1_Tursiops_tru...
[14]   141 HLTADDKKHIKAIWPSVAAHGDK...ATHKALDKFLVSVSNVLTSKYR HBA1_Xenopus_trop...
[15]   141 VLTEEDKARVRVAWVPVSKNAEL...EVLLSVDKFLGQISKVLASRYR HBA1_Microcephalo...
[16]   142 MVLSPADKTNVKAAWGKVGAHAG...AVHASLDKFLASVSTVLTSKYR HBA1_Pan_troglodytes
[17]   142 MVLSADDKTNIKNCWGKIGGHGG...AMHASLDKFLASVSTVLTSKYR HBA1_Rattus_norve...

Now, we will use the function msa() to perform an MSA using the default algorithm (ClustalW). You can also use other algorithms by changing the argument passed to the method parameter.

# Perform MSA
hemoglobin_aln <- msa(hemoglobin_seqs)
use default substitution matrix
hemoglobin_aln
CLUSTAL 2.1  

Call:
   msa(hemoglobin_seqs)

MsaAAMultipleAlignment with 17 rows and 143 columns
     aln                                                   names
 [1] -VLSPADKTNVKAAWGKVGAHAGEY...FTPAVHASLDKFLASVSTVLTSKYR HBA1_Homo_sapiens
 [2] MVLSPADKTNVKAAWGKVGAHAGEY...FTPAVHASLDKFLASVSTVLTSKYR HBA1_Pan_troglodytes
 [3] -VLSPADKSNVKAAWGKVGGHAGEY...FTPAVHASLDKFLASVSTVLTSKYR HBA1_Macaca_mulatta
 [4] -VLSAADKGNVKAAWGKVGGHAAEY...FTPAVHASLDKFLANVSTVLTSKYR HBA1_Bos_taurus
 [5] -VLSPADKTNVKGTWSKIGNHSAEY...FTPSVHASLDKFLASVSTVLTSKYR HBA1_Tursiops_tru...
 [6] -VLSGEDKSNIKAAWGKIGGHGAEY...FTPAVHASLDKFLASVSTVLTSKYR HBA1_Mus_musculus
 [7] MVLSADDKTNIKNCWGKIGGHGGEY...FTPAMHASLDKFLASVSTVLTSKYR HBA1_Rattus_norve...
 [8] -VLSATDKANVKTFWGKLGGHGGEY...FTPAVHASLDKFLATVATVLTSKYR HBA1_Erinaceus_eu...
 [9] -VLSAADKSNVKACWGKIGSHAGEY...FTPAVHASLDKFFSAVSTVLTSKYR HBA1_Felis_silves...
[10] -VLSPADKTNIKSTWDKIGGHAGDY...FTPAVHASLDKFFTAVSTVLTSKYR HBA1_Chrysocyon_b...
[11] -VLSDNDKTNVKATWSKVGDHASDY...FTPEVHASLDKFLSNVSTVLTSKYR HBA1_Loxodonta_af...
[12] -VLSAADKTNVKAAWSKVGGNSGAY...FTPEIHASLDKFLALLSTVLTSKYR HBA1_Monodelphis_...
[13] -MLTDAEKKEVTALWGKAAGHGEEY...FTPSAHAAMDKFLSKVATVLTSKYR HBA1_Ornithorhync...
[14] -VLSAADKNNVKGIFTKIAGHAEEY...LTPEVHASLDKFLCAVGTVLTAKYR HBA1_Gallus_gallus
[15] -HLTADDKKHIKAIWPSVAAHGDKY...FDPATHKALDKFLVSVSNVLTSKYR HBA1_Xenopus_trop...
[16] -VLTEEDKARVRVAWVPVSKNAELY...LKPEVLLSVDKFLGQISKVLASRYR HBA1_Microcephalo...
[17] -SLSDTDKAVVKAIWAKISPKADEI...FTPEVHVSVDKFFNNLALALSEKYR HBA1_Danio_rerio
 Con -VLS?ADK?NVKA?WGK?GGHA?EY...FTPAVHASLDKFLA?VSTVLTSKYR Consensus 

Now that we have an MSA, we can calculate distances.

3.3 Calculating distance matrices

Here, we will calculate distance matrices using 2 different methods:

  • the aastring2dist() function from MSA2dist, which calculates distances from AAStringSet objects using a score matrix. Here, we will use the Grantham matrix (Grantham 1974).

  • the dist.alignment() function from seqinr, which calculates distances from alignment objects using a similarity or identity matrix. Here, we will use the identity matrix.

First, as these 2 functions take different data classes as input, we will have to convert the hemoglobin_aln object to the right classes for each function.

# Convert `MsaAAMultipleAlignment` to `AAStringSet` for MSA2dist
aln_msa2dist <- as(hemoglobin_aln, "AAStringSet")

# Convert `MsaAAMultipleAlignment` to `alignment` for seqinr
aln_seqinr <- msaConvert(hemoglobin_aln, type = "seqinr::alignment")

Now, we can calculate the distance matrices.

library(MSA2dist)
library(seqinr)

# Calculate distance matrix with `MSA2dist::aastring2dist()`
dist_matrix1 <- aastring2dist(aa = aln_msa2dist, score = granthamMatrix())

Computing: [========================================] 100% (done)                         
dist_matrix1 <- as.dist(dist_matrix1$distSTRING)


# Calculate distance matrix with `seqinr::dist.alignment()`
dist_matrix2 <- dist.alignment(aln_seqinr, matrix = "identity")


# Inspect distance matrices
head(dist_matrix1)
[1]  0.000000  2.347518  9.035461  9.446809  8.304965 15.702128
head(dist_matrix2)
[1] 0.0000000 0.1684304 0.3472281 0.4038819 0.3766218 0.4688903
class(dist_matrix1)
[1] "dist"
class(dist_matrix2)
[1] "dist"

As you can see, distance matrices are stored in objects of class dist. This is the data class that we will use to infer phylogenetic trees.

3.4 Inferring trees

Now, we will use the bionjs() function from ape to infer neighbor-joining trees using the two distance matrices we calculated in the previous section.

library(ape)

# Infer neighbor-joining trees
tree1 <- bionjs(dist_matrix1)
tree2 <- bionjs(dist_matrix2)

# Plotting trees
plot(tree1)

plot(tree2)

The two methods lead to different tree topologies. How do you know which one is “true”, then? This is the topic of the next lesson, where you will learn how to use maximum likelihood to assess trees.

Exercises

Use the dist_matrix1 object to infer a phylogeny using the upgma() function from the phangorn package (Schliep 2011). Is the topology the same as the topology of tree1?

# Infer UPGMA tree and compare topologies
tree_upgma <- phangorn::upgma(dist_matrix1)
plot(tree_upgma)

References

Bodenhofer, Ulrich, Enrico Bonatesta, Christoph Horejš-Kainrath, and Sepp Hochreiter. 2015. “Msa: An r Package for Multiple Sequence Alignment.” Bioinformatics 31 (24): 3997–99.
Edgar, Robert C. 2004. “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput.” Nucleic Acids Research 32 (5): 1792–97.
Grantham, Richard. 1974. “Amino Acid Difference Formula to Help Explain Protein Evolution.” Science 185 (4154): 862–64.
Katoh, Kazutaka, and Daron M Standley. 2013. “MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability.” Molecular Biology and Evolution 30 (4): 772–80.
Schliep, K. P. 2011. “Phangorn: Phylogenetic Analysis in r.” Bioinformatics 27 (4): 592–93. https://doi.org/10.1093/bioinformatics/btq706.
Sievers, Fabian, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, et al. 2011. “Fast, Scalable Generation of High-Quality Protein Multiple Sequence Alignments Using Clustal Omega.” Molecular Systems Biology 7 (1): 539.
Thompson, Julie D, Desmond G Higgins, and Toby J Gibson. 1994. “CLUSTAL w: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.” Nucleic Acids Research 22 (22): 4673–80.