The Ortho2tree analysis pipeline seeks to provide consistent sets of canonical protein sequences for organisms that have diverged over the last 100 – 250 million years. In many eukaryotes, an individual gene can produce multiple proteins from different transcript isoforms — different selections of protein coding exons. The Uniprot protein databases select one of those isoforms for inclusion in a reference proteome set; other isoforms are available, but the default, or "canonical" isoform is the one that is most heavily searched and annotated. Historically, the canonical isoform from unreviewed protein sequences (sequences in the Trembl sequence set) was selected using a simple rule: the longest isoform was selected as canonical. This lead to some biologically implausible results; for example mouse and rat sequences for the same enzyme that differ dramatically in length.
For example, mouse, human, chimp, and cow Elongation translation
initiation factor 3 subunit L are 564 amino acids long, and all share
more than 99% identity from beginning to end. In contrast, in Uniprot
release UP2022_05, the rat canonical elogation initiation factor
(A0A8I5ZLD0_RAT) was 621 residues long. While it also shared
99.6% identity to the mouse factor, the alignment was incomplete, covering only residues 13-564 of
the mouse sequence, and 70-621 of the rat. An isoform
of A0A8I5ZLD0_RAT, G3V7G9_RAT is exactly the same
length as the other mammalian proteins, and aligns from beginning to
end as (99.5% identity). Thus, the G3V7G9_RAT became the
canonical isoform with release UP2023_01, based on the ortho2tree
analysis:
Evolutionary tree and multiple alignment used to select an alternative canonical isoform. The evolutionary tree on the left was constructed using sequence distances counting only gaps; substitutions do not contribute to the calculated distance. The multiple sequence alignment on the right shows the location of sequence (blue) and gaps (grey). In this example, the clade containing the canonical human, chimp (PANTR), cow (BOVIN), mouse and rat sequences has a cost of 0.0913, while the clade with the proposed canonical G2V7G9|RAT has a cost of zero. The rat canonical in UP2022_05 is highlighted in red. The proposed rat canonical G3V7G9 is lighlighted in green. Mouse, cow, and chip canonicals that belong to the low-cost clade are highlighted in blue The human canonical is highlighted in dark blue because it is also the MANE selection. Sequence names in black are canonicals that were not part of the clade used to propose a change; sequence name in grey are other isoforms. Numbers in brackets after the sequence name indicate the length of the sequence, and the clade (≥1) that was used to suggest a change.
| Date | UP release | results | Notes |
|---|---|---|---|
| 2022 Sep | ortho2tree poster at qfo workshop in Spain (poster) | ||
| 2023 Feb | 2023_01 | results | First integration into UP genecentric to select canonicals of UP2023_01 (public 2023-Feb-22) based on analysis "qfomam2022_05" 221110 |
| 2023 May | 2023_02 | results | selection of canonicals for UP2023_02 (public 2023-May-03) based on "mam2023_01" 230302 |
| 2023 Jun | 2023_03 | results | selection of canonicals for UP2023_03 (public 2023-Jun-28) based on "mam2023_02" 230327 |
| 2023 Sep | 2023_04 | results | selection of canonicals for UP2023_04 (public 2023-Sep-13) based on "mam2023_03dog" 230621 |
| 2023 Nov | 2023_05 | results | selection of canonicals for UP2023_05 (public 2023-Nov-08) based on "mam2023_04dog_rerun" 230826 |
| 2024 Feb | 2024_01 | results | selection of canonicals for UP2024_01 (public 2024-Jan-24) based on "pmam2023_05om4om18" 231106 |
This directory contains a set of about suggested changes in current canonical assignments for 8 mammalian taxa (BOVIN, CANLF, GORGO, HUMAN, MONDO, MOUSE, PANTR, RAT). The suggestions were identified by:
Many of these clades had mixtures of canonical and isoform sequences, we used a weighting function that sought clades with more SwissProt entries, and more canonicals, weighting contributions from HUMAN > MOUSE > RAT > other taxa.
Here, we are showing suggested changes from the mam (BOVIN, CANLF, GORGO, HUMAN, MONDO, MOUSE, PANTR, RAT) analysis, which occur in 5711 Panther orthogroups. We do not show the 11090 Panther orthogroups where the Ortho2tree analysis is consistent with the current canonical assignments. More than 90% of those orthogroups also agree with the MANE assignment. Thus, there is a large amount of MANE agreement that is not shown in these files.
| Dataset | canonical | proposed | Raw data |
|---|---|---|---|
| qfomam_mane_sp_sp_240206.tab | SwissProt | SwissProt | link |
| qfomam_mane_sp_tr_240206.tab | SwissProt | TrEMBL | link |
| qfomam_mane_tr_tr_240206.tab | TrEMBL | TrEMBL | link |
| qfomam_mane_all_240206.tab | All changes | link |
| pthr_id | panther orthogroup |
| taxon | HUMAN/MOUSE/etc |
| canon_acc | canonical acc |
| canon_len | canonial sequence length |
| prop_acc | proposed canonical (an isoform for this dataset) |
| prop_len | proposed sequence length |
| rank_score | score weighted by prop_cost, n_sp, w_canon, etc. used to rank proposed changes |
| canon_cost | cost of clade with the specified taxa with only canonical sequences |
| prop_cost | cost of clade if proposed canonicals replace canonicals |
| n_sp | number of SP sequences |
| n_tax | number of taxa (5 max for this data) |
| n_canon | number of canonical sequences in clade |
| wn_canon | weighted number of canonical sequences in clade - human and mouse count 2X, rat 1X, other canonicals are not counted |
| clade | "Inner17" — clade label in gap-distance tree |
| clade_members | other members of the proposed clade |
| MANEstatus | MANEstatus: MANE_good, MANE_bad, NAM (orthogroup clade has no HUMAN sequence, so no MANE assignment). |
Suggestions are ordered from highest rank score to lowest. The rank score weights the number of Swiss-Prot canonicals in the clade (higher is better), the weighted number of canonicals (higher better), the proposed clade score (lower better), the fraction of proposed isoforms in the clade (lower better), All of the proposed costs are < 0.02, and many are 0.00 — indicating all the sequences in the clade align without gaps.
The evolutionary trees associated with the orthogroup are included in the pdf_data/ directory. The leaves of the trees are colored to indicate whether they are part of the clade that has been selected, are consistent or inconsistent with the MANE suggestion, and whether the canonical should be changed.
This example (pdf) shows a case where both MANE and Ortho2tree suggest a change.
| PTHR45840:SF4 | 0.8810 | 0.0167 | 5 | 3 | MANE_good | |
| HUMAN | sp|O75783 | (438) | → | sp|O75783-2(373) |

Here, the title text shows the Panther Orthogroup, the score used to rank the suggested change, the tree-cost of the canonical sequences, and the tree-cost for the proposed mixture of canonical and isoform sequences. Also shown is the gap-distance based tree, and the multiple alignment that produced the gap distances. The multiple alignment shows either aligned residues (blue), which may not be identical, and gapped regions (grey).
In this example, two current canonicals selected for change are highlighted in red, and the two proposed-canonical current-isoforms are shown in green (RAT) and light blue (HUMAN), because the isoform suggestion is also made by MANE. (When MANE and Ortho2tree agree that a canonical should stay the same, the taxon is highlighted in dark blue.)
This example (pdf) shows a case where the MANE suggestion and Ortho2tree disagree.
| PTHR20859:SF51 | 0.1332 | 0.0130 | 5 | 4 | MANE_bad | ||
| HUMAN | sp|Q969J5 | (263) | → | sp|Q969J5-2 | (231) |

| blue | current canonical included in clade |
| dark-blue | current canonical in clade that agrees with MANE (MANE_good) |
| green | proposed canonical in clade |
| light blue | proposed canonical in clade supported by MANE (MANE_good) |
| black | current canonical not included in the clade used to suggest changes |
| red | current canonical that is proposed to be changed (there will be a matching green leaf) |
| orange | current canonical proposed to be changed but is selected by MANE (MANE_bad) |
Taxonomic distribution of proposed changes (for the 6-Feb-2024 data):
| taxon | SP to SP | SP to TR | TR to TR |
|---|---|---|---|
| BOVIN | 5 | 31 | 892 |
| CANLF | 0 | 2 | 19 |
| GORGO | 0 | 0 | 1708 |
| HUMAN | 429 | 234 | 0 |
| MONDO | 0 | 0 | 842 |
| MOUSE | 95 | 211 | 54 |
| PANTR | 2 | 4 | 2194 |
| RAT | 38 | 174 | 870 |
Thus, almost all of the SwissProt suggested changes are in HUMAN and MOUSE, while the TrEMBL changes are to RAT and BOVIN, and other less well characterized genomes.
Of the suggested changes to HUMAN canonicals, 181 SwissProt-SwissProt (sp_sp) changes are supported by MANE, 243 are not. For the SwissProt-Trembl changes (sp_tr), 394 are supported by MANE, and 229 are not. However, this analysis only shows suggested changes. In orthogroup clades that confirm the current canonicals, ortho2tree agrees with MANE more than 90% of the time.