The search for the maximum likelihood tree is a NP-hard problem. With RAxML, you usually conduct something like 100 tree searches in order to find a maximum likelihood estimate (MLE) tree. Depending on the shape of your likelihood surface, many of the trees will end up in various local minima. Usually, you will only consider the tree with the best likelihood. However, if you tested various partitioning schemes (i.e., an unpartitioned super-matrix, one partition per gene, some genes and some additional proteins, searches with or without individual branch length optimization), then you obtain trees that are not comparable to each other (i.e., across different partitioning schemes) in terms of likelihood.
However, it is straight-forward to compare their RF-distances. If you concatenate all trees into a file (one tree per line) and run RAxML with -f r, then you obtain a triangular matrix of the topological distances of the trees (= RF-distances). Below, you see a heatmap visualization of the RF-distances of 80 trees. An unpartitioned super-matrix was used to infer 40 trees (red) and the other 40 trees are based on a partitioned dataset (blue). The heatmap.2 function of R clusters the topological distances, such that you easily can see, which trees are very close to each other. I thought, it would be nice, if I was able to inspect the likelihoods of the respective trees at the same time. So I replaced, what is usually a dendrogram with a barplot, that indicates the relative likelihoods of the trees. Relative means, that the tree-likelihood is divided by the average likelihood of all trees inferred under the same partitioning scheme. Smaller bars show a higher likelihood and the per-partition MLE is marked with an additional red bar below the likelihood. But still: bars with different colours are not comparable among each other.
Okay, so what is the recipe to quickly recreate the plot? Download the modified heatmap.2 code and source it in your script. The signature of the plot-function is rfDistancesWithLikelihood(rfDistFile, lnlFile, lnlCol, catCol).
rfDistFile is the RAxML_RF-Distances.runId-file as produced by “RAxML -f n”. lnl-file contains the likelihoods of the trees. With lnlCol you specify the column that contains the likelihoods and catCol is the column that categorizes the trees into the different partitioning schemes. The zip-archive contains example files. Important: When you call RAxML to produce the RF-distances, the order of the tree must be the same as in the lnlFile.
A few things we see in this specific heatmap: for the partitioned analysis, there is a local minimum that is topologically distant from the MLE tree for this partitioning scheme, however not much worse in terms of likelihood. For the searches on the unpartitioned dataset it seems to make a big difference, if we search under GTRGAMMA or GTRCAT.
In general, be careful with the interpretation of the clustering (depends on the clustering-method; use the argument “clustmethod” to change between single, complete or average). Also, there are caveats to the RF-distance: a single rogue taxon that assumes distant positions in two compared trees can lead to extremely high RF-distances, even if the trees are topologically identical without the rogue.
Tags: maximum likelihood tree, phylogenetics, RAxML