In this chapter, we will use the IQ-TREE (Minh et al. 2020) program to infer phylogenetic trees based on maximum-likelihood. You can download IQ-TREE here.
Let’s start by loading the required packages:
library(here)library(ape)set.seed(123) # for reproducibility
4.1 Goals of this lesson
To demonstrate how to use IQ-TREE, we will use an example data set to explore a question that used to be hotly debated years ago:
Are turtles more closely-related to birds or to crocodiles?
There are 3 hypotheses to test, and here we will use maximum-likelihood-based methods to find out which one is true:
In this lesson, you will learn to:
Infer a maximum likelihood tree
Apply partition models and choose the best partitioning schemes
Perform tree topology tests
Identify and remove influential genes
Calculate concordance factors
4.2 Data description
The data we will use in this lesson were obtained from Chiari et al. (2012), and they are stored in the data/ directory associated with this course. The files we will use are:
turtle.fa: a multiple sequence alignment (in FASTA) of a subset of the genes used in the original publication.
turtle.nex: a partition file (in NEXUS) defining a subset of 29 genes.
4.3 Inferring a maximum likelihood tree
To run IQ-TREE, use the command below:
iqtree2-s data/turtle.fa -B 1000 -T 2
Understanding the command-line arguments
The argument -s is mandatory, and this is where you indicate where the file containing the MSA is. In our previous command, iqtree2 -s data/turtle.fa means “run IQ-TREE using the MSA in the file data/turtle.fa”.
The other arguments and their meanings are:
-B: number of replicates for the ultrafast bootstrap (see Minh, Nguyen, and Von Haeseler (2013) for details). Here, we used 1000 replicates.
-T: number of CPU cores to use. Here, we’re using 2 cores, because IQ-TREE defaults to using all available cores.
For a complete list of arguments and the possible values they take, run iqtree2 -h.
The main IQ-TREE report will be stored in a file ending in .iqtree, and the maximum likelihood tree (in Newick format) will be stored in a file ending in .treefile.
Exercises
Look at the report file in data/turtle.fa.iqtree and answer the questions below. Hint: you can use the readLines() function to read the output to the R session.
What is the best-fit model name?
What are the AIC/AICc/BIC scores of this model and tree?
Visualise the tree in data/turtle.fa.treefile. What relationship among three trees does this tree support? What is the ultrafast bootstrap support (%) for the relevant clade?
In the figure below, you can see the tree published by Chiari et al. (2012). Does the inferred tree agree with the published tree?
Now, we will infer a ML tree applying a partition model (Chernomor, Von Haeseler, and Minh 2016), which means that each partition (specified in the turtle.nex file) will be allowed to have its own model.
The only new argument here is -p turtle.nex, which is used to specify an edge-linked proportional partition model, so that each partition can have shorter or longer tree length (i.e., slower or faster evolutionary rates, respectively).
As in the simpler IQ-TREE run, the main report is in a file ending in .nex.iqtree, and the tree is in a file named .nex.treefile.
Exercises
Look at the report file in data/turtle.nex.iqtree. What are the AIC/AICc/BIC scores of the partition model? Is it better than the previous model?
Visualize the tree in data/turtle.nex.treefile. What relationship among three trees does this tree support? What is the ultrafast bootstrap support (%) for the relevant clade?
Does the inferred tree agree with the published tree (Chiari et al. 2012)?
[1] "Best-fit model according to BIC: K2P+I+G4:ENSGALG00000000223.macse_DNA_gb,K2P+G4:ENSGALG00000001529.macse_DNA_gb,TN+F+G4:ENSGALG00000002002.macse_DNA_gb,TVM+F+I+G4:ENSGALG00000002514.macse_DNA_gb,TIM2e+I+G4:ENSGALG00000003337.macse_DNA_gb,TIM2+F+G4:ENSGALG00000003700.macse_DNA_gb,TN+F+G4:ENSGALG00000003702.macse_DNA_gb,K2P+I+G4:ENSGALG00000003907.macse_DNA_gb,TNe+I:ENSGALG00000005820.macse_DNA_gb,TIM2+F+G4:ENSGALG00000005834.macse_DNA_gb,TNe+G4:ENSGALG00000005902.macse_DNA_gb,TIM3e+G4:ENSGALG00000008338.macse_DNA_gb,TPM2+F+G4:ENSGALG00000008517.macse_DNA_gb,TPM2+F+I+G4:ENSGALG00000008916.macse_DNA_gb,TIM2+F+R3:ENSGALG00000009085.macse_DNA_gb,TN+F+G4:ENSGALG00000009879.macse_DNA_gb,TPM3+F+G4:ENSGALG00000011323.macse_DNA_gb,TIM3e+I+G4:ENSGALG00000011434.macse_DNA_gb,TIM3+F+G4:ENSGALG00000011917.macse_DNA_gb,K2P+G4:ENSGALG00000011966.macse_DNA_gb,SYM+G4:ENSGALG00000012244.macse_DNA_gb,TN+F+G4:ENSGALG00000012379.macse_DNA_gb,TNe+G4:ENSGALG00000012568.macse_DNA_gb,TPM2+F+G4:ENSGALG00000013227.macse_DNA_gb,TN+F+G4:ENSGALG00000014038.macse_DNA_gb,TN+F+G4:ENSGALG00000014648.macse_DNA_gb,TIM+F+G4:ENSGALG00000015326.macse_DNA_gb,TIM2+F+G4:ENSGALG00000015397.macse_DNA_gb,TPM2+F+I+G4:ENSGALG00000016241.macse_DNA_gb"
idx <-grep("List of best-fit models", part_report)idx <-seq(idx, idx +30)part_report[idx]
Besides the arguments we’ve already seen, the new arguments and their meanings are:
-m: specifies the model to use. Here, MFP+MERGE indicates running PartitionFinder followed by tree reconstruction.
-rcluster: to reduce computations by only examining the top n% (here, 10%) partitioning schemes using the relaxed clustering algorithm(Lanfear et al. 2014).
--prefix: specifies the prefix for all output files to avoid overwriting the output of previous runs.
The main report is a file ending in .merge.iqtree, and the tree is in a file ending in .merge.treefile.
Exercises
Look at the report file data/turtle.merge.iqtree and answer the questions:
How many partitions do we have now?
Look at the AIC/AICc/BIC scores. Compared with two previous models, is this model better or worse?
Visualize the tree in data/turtle.merge.treefile. What relationship among three trees does this tree support? What is the ultrafast bootstrap support (%) for the relevant clade?
Tree topology tests can be used to find out if different trees have significant differences in log-likelihoods. To do that, you can use the SH test (Shimodaira and Hasegawa 1999) or expected likelihood weights (Strimmer and Rambaut 2002).
Before running the tests, we first need to concatenate the trees in a single file:
# Read tree inferred with a single model as plain texttree_single <-readLines(here("data", "turtle.fa.treefile"))# Read tree inferred with partition models as plain texttree_partitioned <-readLines(here("data", "turtle.nex.treefile"))# Combine trees and export to filetree_combined <-c(tree_single, tree_partitioned)writeLines(tree_combined, con =here("data", "turtle.trees"))
Now, we can pass the concatenated tree to IQ-TREE:
Besides the arguments we’ve already seen, the new arguments and their meanings are:
-z: path to the file containing the concatenated trees.
-zb: number of replicates for approximate bootstrap for the tree topology tests (here, 10000).
-au: perform the Approximately Unbiased test.
-n 0: avoid tree search and just perform tree topology tests.
The main report is a file ending in .test.iqtree, and the tree is in a file ending in .test.treefile.
The KH, SH and AU tests return p-values. Thus, a tree is rejected if its p-value <0.05 (marked with a - sign).
Exercises
Look at the “USER TREES” section in the report file data/turtle.test.iqtree and answer the questions:
Which tree has the worst log-likelihood?
Can you reject this tree according to the Shimodaira Hasegawa test, assuming a p-value cutoff of 0.05?
Can you reject this tree according to the Approximately Unbiased test, assuming a p-value cutoff of 0.05?
Solution
# Read report and get the "USER TREES" sectiontest_report <-readLines(here("data", "turtle.test.iqtree"))idx <-grep("USER TREES", test_report)idx <-seq(idx, length(test_report))test_report[idx]
[1] "USER TREES"
[2] "----------"
[3] ""
[4] "See data/turtle.test.trees for trees with branch lengths."
[5] ""
[6] "Tree logL deltaL bp-RELL p-KH p-SH c-ELW p-AU"
[7] "-------------------------------------------------------------------------"
[8] " 1 -115741.0735 4.9507 0.424 + 0.426 + 0.426 + 0.424 + 0.412 + "
[9] " 2 -115736.1228 0 0.576 + 0.574 + 1 + 0.576 + 0.588 + "
[10] ""
[11] "deltaL : logL difference from the maximal logl in the set."
[12] "bp-RELL : bootstrap proportion using RELL method (Kishino et al. 1990)."
[13] "p-KH : p-value of one sided Kishino-Hasegawa test (1989)."
[14] "p-SH : p-value of Shimodaira-Hasegawa test (2000)."
[15] "c-ELW : Expected Likelihood Weight (Strimmer & Rambaut 2002)."
[16] "p-AU : p-value of approximately unbiased (AU) test (Shimodaira, 2002)."
[17] ""
[18] "Plus signs denote the 95% confidence sets."
[19] "Minus signs denote significant exclusion."
[20] "All tests performed 10000 resamplings using the RELL method."
[21] ""
[22] "TIME STAMP"
[23] "----------"
[24] ""
[25] "Date and time: Wed Apr 12 10:10:34 2023"
[26] "Total CPU time used: 6.99 seconds (0h:0m:6s)"
[27] "Total wall-clock time used: 17 seconds (0h:0m:17s)"
[28] ""
4.7 Concordance factors
In previous analyses, we were assuming that gene trees and species tree are the same. However, in empirical data, gene trees are often different from species trees, which we call discordance. Now, we will infer different gene trees for each partition separately, and then quantify the concordance between gene trees and species tree with concordance factors.
First, let’s infer one gene tree for each partition.
The code above calculates two kinds of concordance factors:
Gene concordance factor (gCF): percentage of decisive gene trees concordant with a particular branch of the species tree (0% <= gCF(b) <= 100%).
Site concordance factor (sCF): percentage of decisive alignment sites supporting a particular branch of the species tree (~33% <= sCF(b) <= 100%). sCF <33% means that another discordant branch b’ is more supported, whereas sCF=100% means that branch b is supported by all sites.
Exercises
Visualize the tree in data/turtle.nex.treefile.cf.tree. How do gCF and sCF look compared with bootstrap support values?
Challenge: read concordance factors stats in file data/turtle.nex.treefile.cf.stat and create a scatterplot using {ggplot2} showing gCF in the x-axis, sCF in the y-axis, and points colored by bootstrap support values. What do you conclude?
Chernomor, Olga, Arndt Von Haeseler, and Bui Quang Minh. 2016. “Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices.”Systematic Biology 65 (6): 997–1008.
Chiari, Ylenia, Vincent Cahais, Nicolas Galtier, and Frédéric Delsuc. 2012. “Phylogenomic Analyses Support the Position of Turtles as the Sister Group of Birds and Crocodiles (Archosauria).”Bmc Biology 10: 1–15.
Lanfear, Robert, Brett Calcott, Simon YW Ho, and Stephane Guindon. 2012. “PartitionFinder: Combined Selection of Partitioning Schemes and Substitution Models for Phylogenetic Analyses.”Molecular Biology and Evolution 29 (6): 1695–1701.
Lanfear, Robert, Brett Calcott, David Kainer, Christoph Mayer, and Alexandros Stamatakis. 2014. “Selecting Optimal Partitioning Schemes for Phylogenomic Datasets.”BMC Evolutionary Biology 14: 1–14.
Minh, Bui Quang, Minh Anh Thi Nguyen, and Arndt Von Haeseler. 2013. “Ultrafast Approximation for Phylogenetic Bootstrap.”Molecular Biology and Evolution 30 (5): 1188–95.
Minh, Bui Quang, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D Woodhams, Arndt Von Haeseler, and Robert Lanfear. 2020. “IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era.”Molecular Biology and Evolution 37 (5): 1530–34.
Shimodaira, Hidetoshi, and Masami Hasegawa. 1999. “Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference.”Molecular Biology and Evolution 16 (8): 1114.
Strimmer, Korbinian, and Andrew Rambaut. 2002. “Inferring Confidence Sets of Possibly Misspecified Gene Trees.”Proceedings of the Royal Society of London. Series B: Biological Sciences 269 (1487): 137–42.