TINNIK
infers a “tree of blobs” under the network
multispecies coalescent model. The resulting unrooted topological tree
partially depicts the species relationships that led to a collection of
gene trees, showing only the cut edges of the network (ones which join
otherwise unconnected pieces of the network). Any cycles or more
complicated blobs formed by reticulations in the network are contracted
to multifurcations (polytomies) in this tree.
The tree of blobs thus isolates those parts of the network where
reticulations have made relationships non-tree-like. A researcher might
then apply other methods to investigate the structure of each blob,
perhaps by reducing to consideration of a smaller number of taxa.
However, with current methods inferring complicated blob structure
accurately may be difficult or impossible. Since the TINNIK
algorithm is statistically consistent under the NMSC model regardless of
the unknown blob structure, it may provide the strongest network
inference possible without making assumptions on the unknown network
structure.
Many current methods of network inference are statistically
consistent only if the unknown network is level-1 (each blob is as
simple as possible, containing only a single reticulation). They can
only return a level-1 network as output, and may give no indication as
to whether a level-1 network is an adequate model for the data. When a
level-1 method is used for network inference, checking that
TINNIK
’s results are consistent with the level-1 output can
provide a researcher some justification for that assumption.
TINNIK
requires input in the form of a collection of
gene trees on a common collection of taxa. Thus from multilocus sequence
data it is necessary to first align each gene’s sequences and use
standard phylogenetic methods and software (e.g., IQtree, RAxML) to
obtain an unrooted topological tree for each gene. While these trees are
not strictly data, as they are themselves inferred, TINNIK
treats them as such. If the gene trees are rooted, or have edge lengths,
that information is simply ignored.
We work with an example data set of 1730 gene trees of 7 Papionini
species, extracted from Vanderpool et al.
(2020) and analyzed by TINNIK
in Allman et al. (2024). Gene trees in Newick
format are read from a text file to a multiphylo object using an
ape function:
# read text file of gene trees supplied with MSCquartets package
gts=read.tree(file = system.file("extdata","dataPapioniniVanderpool",package="MSCquartets"))
While some missing taxa on gene trees can be handled by
TINNIK
, it is necessary that each subset of 4 taxa appears
on at least 1 gene tree and desirable that it appears on many.
Statistical tests are performed for each such set, and the amount of
data for these is determined by the count of trees displaying the
set.
All understanding of TINNIK
’s statistical behavior is
established by assuming that the input gene trees are a true sample
under the NMSC model. In practice, inferred gene trees are used, which
likely contain some error. In particular, as with other network
inference methods using inferred gene trees as input, widespread lack of
resolution in the gene trees is unlikely to lead to a good analysis,
The first steps of TINNIK
are to count occurrences of
each quartet tree topology across all gene trees, for each set of 4
taxa, and then to apply two hypothesis tests to these counts to judge
their fit to 4-taxon star and resolved species trees, producing two
p-values, “p_star” and “p_T3”
The easiest way to do this is to run the commands below, where we
save the table of quartet counts and associated p-values as “pT”. The
TINNIK
command here does a full analysis, which we
generally should not take as our final one, as we are using default
nominal levels for the hypothesis tests.
# perform initial TINNIK analysis for gene trees, using defaults
output=TINNIK(gts)
#> Analyzing 7 taxa: Cercocebus_atys, Macaca_fascicularis, Macaca_mulatta, Macaca_nemestrina, Mandrillus_leucophaeus, Papio_anubis, Theropithecus_gelada
#> Counting occurrences of displayed quartets for 35 four-taxon subsets of 7 taxa across 1730 gene trees.
#> Warning in quartetTable(genetrees, taxanames, epsilon = epsilon): Some taxa
#> missing from some trees.
#> Applying hypothesis test for model T3 to 35 quartets.
#> Applying hypothesis test for star tree model to 35 quartets.
# save table of quartet information and p-values
pT=output$pTable
The TINNIK
command produced three plots. The first is a
simplex plot of the hypothesis test results, at the specified levels.
Each set of 4 taxa is represented by one plotted symbol, with those near
the 3 lines indicating a tree-like relationship, and those farther from
the lines a putative non-tree-like one. Those near the centroid of the
simplex indicate a star-like relationship. (See Allman, Mitchell, and Rhodes (2021) for a more
complete explanation of these plots.)
The second simplex plot shows the outcome of the TINNIK
inference rule, where some of the quartets initially viewed as tree-like
(T-quartets) from the hypothesis test are subsequently flagged as coming
off a blob (B-quartets). (See Allman et al.
(2022) or Allman et al. (2024) for
a more complete explanation.)
The final plot shows TINNIK
’s inferred tree of blobs,
with 5- and 4- multifurcations joined by an edge, indicating 4- and
5-blobs. Note this result is dependent on the levels of the hypothesis
tests shown in the subtitle of the plot. The default values are α = .05 for the test with null
hypothesis “the quartet has a tree-like relationship”, and β = .95 for the test with null
hypothesis “the quartet has a star-like relationship”.
Since this data set is relatively small, it does not take long to produce the table of quartet counts and p-values, and we could simply recalculate it for additional runs. For a large data set (many taxa, many gene trees) producing this table is the most significant factor in run-time, so saving it for reuse is wise.
The table “pT” can be printed directly in R, or printed in a nicer
format with quartetTablePrint
. However, its rows should not
be reordered, as the TINNIK
algorithm uses an indexing
function to access them quickly.
Instead of the T3 hypothesis test for tree-likeness, the analysis can
be run using the cut test (see the TINNIK
documentation).
Doing so may produce fewer initial B-quartets, and thus a more resolved
tree of blobs, but is more appropriate only for what might be considered
extreme network structures (see Allman et al.
(2024)).
An initial run of TINNIK
with default test levels is
rarely a sufficient analysis. One should always vary the test levels
α and β to judge robustness of the
inferred tree of blobs to their choice.
We generally have little understanding of the error that might be in the gene trees, and making different choices of the levels can be used to address this somewhat. Setting α = 0 will lead to all quartets being judged as tree-like, and increasing α will potentially increase the number considered to have putative 4-blobs. In a noisy data set, the default α = .05 may result in many putative 4-blobs, and a smaller value of α may be needed for a useful inference. Since hypothesis testing is done on each set of 4 taxa independently, even with no gene tree error merely having more taxa present makes the erroneous judgment of some putative 4-blobs more likely.
Similarly, a value of β = 1 will result in no quartet relationship being judged as star-like, but decreasing β may result in more star-like quartets.
Smaller αs and/or larger βs tend to produce more resolution in the inferred tree of blobs.
For our example data set, we see no symbols plotted near the centroid in the first simplex plot, so only values of β very close to 0 are likely to result in any judgment of star trees. An examination of “pT” confirms this, as the maximum in the “p_star” column is ≈ .112e − 32. To see the effect of a tiny β, we reuse the tabulated quartet information in “pT” and enter:
TINNIK(pT, alpha=.05, beta=1e-40)
#> Some points (green) not rejected as star, but rejected as tree.
Here the first simplex plot shows a few symbols where the star-tree
is now not rejected. However, as these are treated by the
TINNIK
algorithm as B-quartets, we see no change in the
next simplex plot or the inferred tree of blobs from our initial
TINNIK
run.
Since the original simplex plot shows rejection of tree-likeness for some quartets plotted near the line segments, varying α is likely to be more interesting. Using a smaller α, to potentially increase the number of quartets which are judged as tree-like, we enter:
The first simplex plot indeed shows fewer red triangles (initial B-quartets) and more blue circles, while the second plot has fewer gold squares (all B-quartets) and more green “x”s, producing an inferred tree of blobs with more resolution — in fact, a fully-resolved tree.
This leads us to try a choice of α intermediate to the previous ones, so we enter:
Here the simplex plots and the tree of blobs vary from either of the earlier ones, showing one of the multifurcations in our initial analysis, but resolving the other. Further varying α can give intervals over which these analyses are stable.
In reporting any TINNIK
analysis it is essential to
report the test levels used, and we strongly recommend exploring with
values beyond the defaults.