Inferring a Level-1 network with the NANUQ+ routines

library(MSCquartets)
#> Loading required package: ape
#> Loading required package: phangorn

Why use NANUQ+ routines?

NANUQ+ is a collection of routines for inference of level-1 network features under the network multispecies coalescent (NMSC) model. Notably, they do not assume that the entire network is level-1. Taking as input a collection of unrooted topological gene trees, output is an unrooted (semidirected) topological level-1 network that captures species relationships, displaying cycles and cut edges (tree-like network features). Complex blobs (those that are not cycles) may be left as multifurcations of degree greater than 3 (polytomies) in this structure.

The initial step infers the unknown network’s tree of blobs, using the function TINNIK (described in another vignette). This tree partially represents species relationships, showing only cut edges and contracting all reticulation cycles to multifurcations. Multifurcations suspected to arise from cycles can then be resolved into optimal cycle structures using resolveCycle. This resolution relies on a least-squares approach, comparing an empirical NANUQ distance among taxon groups around a multifurcation to expected distances for possible cycle configurations. For cycles under a default size of 10, a full search is fast, while larger cycles are processed using a quartet-based heuristic.

The user may combine cycle resolutions for some or all of the individual multifurcations with combineCycle. Blobs that are not suspected to arise from cycles can be left unresolved. When all multifurcations are believed to be cycle-derived, resolveLevel1 both resolves all multifurcations and combines the resolutions into fully-resolved level-1 networks.

We recommend using the NANUQ function before resolving any multifurcations, to assess which ones may have resulted from network cycles. For a level-1 network, NANUQ generates a splits graph with distinctive structural features called ‘darts’. If the level-1 assumption is violated, this structural pattern is disrupted, providing insight into whether resolving blobs as cycles is warranted.

For blobs that are cycles, all procedures here provide statistically consistent estimates of network structure under the NMSC model.

Preparing the input data

The NANUQ+ routines take as input a collection of gene trees for a set of taxa. From multilocus sequence data, standard phylogenetic tools must first be used to infer unrooted topological trees for each gene. Although these trees are not raw data, they are treated as such. If the gene trees are rooted or contain edge lengths, that information is simply ignored in these routines.

We work with an example data set of 16,338 gene trees of 16 Leopardus species, supplied by Lescroart et al. (2023) and analyzed with NANUQ+ by Allman, Baños, Rhodes, et al. (2024). Gene trees in Newick format can be read from a text file and their displayed quartet trees counted with commands such as:

# read text file of gene trees and count quartets on them

gts<-read.tree(file = 'genetreefile')
tableLeopardusLescroart=quartetTable(gts)

We will use a table of this data supplied within MSCquartets, pre-computed by the above commands. Both TINNIK and NANUQ can take as input either the gene tree file or the quartet table.

# load data file containing quartet counts for Leopardus data set supplied with MSCquartets package
# These counts are will be accessed as `tableLeopardusLescroart`.

data(tableLeopardusLescroart)

Notes:

All functions used here allow gene trees with missing taxa, provided each subset of four taxa appears on at least one tree (ideally, on many). Statistical tests are conducted for each of these subsets, with the sample size being the number of trees displaying the subset.

The good statistical behavior of the analysis below is established under the assumption that the input gene trees are a true sample from the NMSC model. In practice, inferred gene trees, containing some error, are used instead. Poorly inferred gene trees or those lacking sufficient resolution can compromise results.

Step 1: Inferring the tree of blobs

The function TINNIK can be used to infer the tree of blobs for an unknown network. This first applies two hypothesis tests for each choice of 4 taxa, testing whether the gene tree counts allow for the rejection of a resolved quartet tree or a star tree for those species. (See TINNIK’s vignette or (Allman, Baños, Mitchell, et al. 2024) for more information.)

# perform TINNIK analysis for gene trees, using defaults

output<-TINNIK(tableLeopardusLescroart)
#> Applying hypothesis test for model T3 to 1820 quartets.
#> Applying hypothesis test for star tree model to 1820 quartets.

# save table of quartet information with p-values

pT<-output$pTable

The default TINNIK test levels are α = .05 for the null hypothesis “the quartet has a tree-like relationship”, and β = .95 for the null hypothesis “the quartet has a star-like relationship”. An initial run of TINNIK with default test levels is rarely a sufficient analysis. One should always vary the α and β to judge robustness of the inferred tree of blobs to their choice.

As detailed by Allman, Baños, Rhodes, et al. (2024), for this data we choose α = 5e − 29, imposing a stringent test for judging non-tree-likeness. Note in the first plot below this gives a clear separation between red (quartets interpreted as 4-blobs) and blue (quartets interpreted as tree-like) symbols, allowing for noise in the gene trees without rejecting tree-like-ness too strongly.

For this dataset and test levels we obtained a tree of blobs with two multifurcations, both of degree five.

# perform improved TINNIK analysis to infer the tree of blobs

output<-TINNIK(pT, alpha=5e-29,beta = 0.95)

Output produced by TINNIK includes the table of quartet counts and associated test p-values and the tree of blobs. We rename these to use as input in further functions.

# run TINNIK to infer the tree of blobs

output<-TINNIK(pT, alpha=5e-29,beta = 0.95)

# rename output 

pT<-output$pTable #quartet count data with p-values for tests
ToB<-output$ToB   #the TINNIK tree of blobs

Using NANUQ to explore whether multifurcations should be resolved into cycles

The NANUQ function can help detect whether a blob might be a cycle. On it’s own, NANUQ is a statistically consistent level-1 network inference algorithm under the NMSC model. It takes the same input as TINNIK (a set of gene trees or a quartet table, along with two test levels) and produces, among other outputs, a distance measure that can be used as input to the neighborNet algorithm (as implemented in the phangorn package) to generate a splits network with specific structural properties. When the model is violated, this structure is disrupted, providing insight into potential violations of the level-1 assumption.

Further details on NANUQ and its applications are given by Allman, Baños, and Rhodes (2019) and Rhodes et al. (2020). We recommend using the same test levels but encourage testing alternative levels as well.

# perform NANUQ analysis for table of quartet information and p-values

D<-NANUQ(pT, alpha = 5e-29,beta = 0.95) # Run the NANUQ routine
#> Distance table written to file: NANUQdist_alpha5e-29_beta0.95.nex
NN<-neighborNet(D$dist) # Run the NeighborNet algorithm on the NANUQ distance
plot(NN) # plot the splits-graph with neighborNet

As in (Allman, Baños, Rhodes, et al. 2024), we identified one multifurcation that aligns well with the model hypothesis, displaying a characteristic dart-like structure, and another that does not. We cannot determine whether the latter multifurcation results from a model violation (presumably level-1-ness) or simply noise. Nevertheless, this indicates caution is needed in interpreting findings associated with this multifurcation.

Note:

We used the NeighborNet implementation neighborNet in the phangorn package here, to do the full analysis within R. However, we recommend using SplitsTree (Huson and Bryant 2006) for the splits graph, as it has been more thoroughly tested. For more details, see (Rhodes et al. 2020).

Step 2: Resolving multifurcations

When resolving the multifurcations, the first step is to numerically label all internal nodes in the tree of blobs, so that we and the code can refer to individual multifurcations.

  #Label internal nodes of the tree of blobs, and plot

  ToB<-labelIntNodes(ToB)

Here the multifurcations are nodes 18 and 20. We will resolve each one independently using the function resolveCycle. While the same test levels as in TINNIK can be used, we suggest experimenting with different levels.

  # resolve node 18

  resC18<-resolveCycle(ToB,18,pT,alpha=5e-29,beta=0.95)

The output resC18 is a list that includes a network displaying the best resolution of node 18, with the other multifurcation remaining unresolved. Other information in this list is needed for combining resolutions of different multifurcations, or producing the histogram of residuals for all possible resolutions.

  # resolve node 2O

  resC20<-resolveCycle(ToB,20,pT,alpha=5e-29,beta=0.95)
#> For node 20, 5 resolutions found.

Note multifurcation 20 produced five optimal resolutions (that is, those 5 resolutions have residual sums of squares that are within a small tolerance). This suggests that this multifurcation may not arise from a cycle, or that the data has too much noise to be sure.

Alternatively, if one suspects that all multifurcations arise from a cycle, one could attempt to resolve all multifurcations to obtain a level-1 network.

#Fully resolve the tree of blobs to a level-1 network

resN<-resolveLevel1(ToB=output$ToB, pTable=output$pTable, alpha=5e-29, beta=0.95, 
                    distance="NANUQ")
#> 2 multifurcations on tree, at nodes 18, 20.
#> For node 20, 5 resolutions found.

The output resN includes, as its first element, a list of the 5 level-1 networks that accommodate all the optimal resolutions that can co-occur.

Note:

In reporting any NANUQ+ analysis, it is essential to report the test levels used, and, as mentioned throughout, we strongly recommend exploring with values beyond the defaults.

Estimating numerical parameters and further comparing multiple optimal resolutions

NANUQ+ yielded 5 fully-resolved topological networks for this data analysis.

One can use the Julia package PhyloNetworks (Solı́s-Lemus, Bastide, and Ané 2017) to optimize numerical parameters (edge lengths and hybridization parameters) on these under a quartet pseudo-likelihood criterion, and also use the pseudo-likelihood score to further compare these networks. Note, however, that comparing networks by psuedo-likelihood is a compromise due to the difficulty of computing full likelihoods.

The commands below (based on 2024 PhyloNetworks) provide an example of this. Note that this is Julia code and not R code, and it requires additional installation of software.

using PhyloNetworks #Load package
gts = readMultiTopology("genetreefile") #load gene trees to Julia
q, t = countquartetsintrees(gts; showprogressbar = true) #compute the quartet counts
global df = writeTableCF(q, t) # to get a DataFrame that can be saved to a file later
global empCFs = readTableCF(df)
ntwk = readMultiTopology("Network.nwk", false) # read the topology of the newtwok to be optimized
net = topologyMaxQPseudolik!(ntwk,empCFs) # optimize network parameters to obtain the pseudo-likelihood score

Note:

Inferring numerical parameters is only suggested for networks that are fully resolved, such as those from the function resolveLevel1. Attempted optimization on networks that are partially resolved, where some multifurcations represent unknown blob structures, may lead to erroneous inference.

References

Allman, E. S., H. Baños, J. D. Mitchell, and J. A. Rhodes. 2024. “TINNiK: Inference of the Tree of Blobs of a Species Network Under the Coalescent.” bioRxiv. https://doi.org/10.1101/2024.04.20.590418.
Allman, E. S., H. Baños, and J. A. Rhodes. 2019. NANUQ: A Method for Inferring Species Networks from Gene Trees Under the Coalescent Model.” Algorithms Mol. Biol. 14 (24): 1–25. https://doi.org/10.1186/s13015-019-0159-2.
Allman, E. S., H. Baños, J. A. Rhodes, and K. Wicke. 2024. NANUQ+: A Divide-and-Conquer Approach to Network Estimation.”
Huson, D. H., and D. Bryant. 2006. “Application of Phylogenetic Networks in Evolutionary Studies.” Molecular Biology and Evolution 23 (2): 254–67.
Lescroart, Jonas, Alejandra Bonilla-Sánchez, Constanza Napolitano, Diana L Buitrago-Torres, Héctor E Ramı́rez-Chaves, Paola Pulido-Santacruz, William J Murphy, Hannes Svardal, and Eduardo Eizirik. 2023. Extensive Phylogenomic Discordance and the Complex Evolutionary History of the Neotropical Cat Genus Leopardus.” Molecular Biology and Evolution 40 (12): msad255. https://doi.org/10.1093/molbev/msad255.
Rhodes, J. A., H. Baños, J. D. Mitchell, and E. S. Allman. 2020. MSCquartets 1.0: Quartet methods for species trees and networks under the multispecies coalescent model in R.” Bioinformatics, October. https://doi.org/10.1093/bioinformatics/btaa868.
Solı́s-Lemus, Claudia, Paul Bastide, and Cécile Ané. 2017. PhyloNetworks: A Package for Phylogenetic Networks.” Molecular Biology and Evolution 34 (12): 3292–98. https://doi.org/10.1093/molbev/msx235.