Reference Proteomes Canonical changes suggested 6-Feb-2024 (rev)

[Uniprot 2023_01 (Ensembl 107)]


SwissProt to SwissProt changes (566 suggestions)
SwissProt to TrEMBL changes (555 suggestions)
TrEMBL to TrEMBL changes (5377 suggestions)

Ortho2tree — a better way to select canonical isoforms

The Ortho2tree analysis pipeline seeks to provide consistent sets of canonical protein sequences for organisms that have diverged over the last 100 – 250 million years. In many eukaryotes, an individual gene can produce multiple proteins from different transcript isoforms — different selections of protein coding exons. The Uniprot protein databases select one of those isoforms for inclusion in a reference proteome set; other isoforms are available, but the default, or "canonical" isoform is the one that is most heavily searched and annotated. Historically, the canonical isoform from unreviewed protein sequences (sequences in the Trembl sequence set) was selected using a simple rule: the longest isoform was selected as canonical. This lead to some biologically implausible results; for example mouse and rat sequences for the same enzyme that differ dramatically in length.

For example, mouse, human, chimp, and cow Elongation translation initiation factor 3 subunit L are 564 amino acids long, and all share more than 99% identity from beginning to end. In contrast, in Uniprot release UP2022_05, the rat canonical elogation initiation factor (A0A8I5ZLD0_RAT) was 621 residues long. While it also shared 99.6% identity to the mouse factor, the alignment was incomplete, covering only residues 13-564 of the mouse sequence, and 70-621 of the rat. An isoform of A0A8I5ZLD0_RAT, G3V7G9_RAT is exactly the same length as the other mammalian proteins, and aligns from beginning to end as (99.5% identity). Thus, the G3V7G9_RAT became the canonical isoform with release UP2023_01, based on the ortho2tree analysis:


.

Evolutionary tree and multiple alignment used to select an alternative canonical isoform. The evolutionary tree on the left was constructed using sequence distances counting only gaps; substitutions do not contribute to the calculated distance. The multiple sequence alignment on the right shows the location of sequence (blue) and gaps (grey). In this example, the clade containing the canonical human, chimp (PANTR), cow (BOVIN), mouse and rat sequences has a cost of 0.0913, while the clade with the proposed canonical G2V7G9|RAT has a cost of zero. The rat canonical in UP2022_05 is highlighted in red. The proposed rat canonical G3V7G9 is lighlighted in green. Mouse, cow, and chip canonicals that belong to the low-cost clade are highlighted in blue The human canonical is highlighted in dark blue because it is also the MANE selection. Sequence names in black are canonicals that were not part of the clade used to propose a change; sequence name in grey are other isoforms. Numbers in brackets after the sequence name indicate the length of the sequence, and the clade (≥1) that was used to suggest a change.


Ortho2tree release timeline

     Date     UP releaseresultsNotes
2022 Sep  ortho2tree poster at qfo workshop in Spain (poster)
2023 Feb2023_01resultsFirst integration into UP genecentric to select canonicals of UP2023_01 (public 2023-Feb-22) based on analysis "qfomam2022_05" 221110
2023 May2023_02resultsselection of canonicals for UP2023_02 (public 2023-May-03) based on "mam2023_01" 230302
2023 Jun2023_03resultsselection of canonicals for UP2023_03 (public 2023-Jun-28) based on "mam2023_02" 230327
2023 Sep2023_04results selection of canonicals for UP2023_04 (public 2023-Sep-13) based on "mam2023_03dog" 230621
2023 Nov2023_05results selection of canonicals for UP2023_05 (public 2023-Nov-08) based on "mam2023_04dog_rerun" 230826
2024 Feb2024_01results selection of canonicals for UP2024_01 (public 2024-Jan-24) based on "pmam2023_05om4om18" 231106
Note: Ortho2tree canonical suggestions integrated in UniProt release "n" (e.g. UP2023_02) are based on UniProt data from release "n-1" (UP2023_01).

Data description

This directory contains a set of about suggested changes in current canonical assignments for 8 mammalian taxa (BOVIN, CANLF, GORGO, HUMAN, MONDO, MOUSE, PANTR, RAT). The suggestions were identified by:

  1. starting with about 20,000 Panther orthogroups
  2. identifying the gene-centric clusters/isoforms for each of the canonicals in the Panther set
  3. removing panther orthogroups with sequences from fewer than 3 taxa (leaving about 17,000).
  4. building a multiple sequence alignment of all the canonical and isoform sequences from all the taxa in the orthogroup
  5. constructing a gap-distance tree from the multiple alignment (only gapped residues contributed to the distance, thus 4 orthologs that aligned without gaps would have a distance of zero.
  6. Identifying clades in the tree with low cost and a diverse set of taxa.

Many of these clades had mixtures of canonical and isoform sequences, we used a weighting function that sought clades with more SwissProt entries, and more canonicals, weighting contributions from HUMAN > MOUSE > RAT > other taxa.

Here, we are showing suggested changes from the mam (BOVIN, CANLF, GORGO, HUMAN, MONDO, MOUSE, PANTR, RAT) analysis, which occur in 5711 Panther orthogroups. We do not show the 11090 Panther orthogroups where the Ortho2tree analysis is consistent with the current canonical assignments. More than 90% of those orthogroups also agree with the MANE assignment. Thus, there is a large amount of MANE agreement that is not shown in these files.


For each dataset, four files are provided:
DatasetcanonicalproposedRaw
data
qfomam_mane_sp_sp_240206.tabSwissProtSwissProtlink
qfomam_mane_sp_tr_240206.tabSwissProtTrEMBLlink
qfomam_mane_tr_tr_240206.tabTrEMBLTrEMBLlink
qfomam_mane_all_240206.tabAll changes link

Raw data files have 8 fields:
pthr_idpanther orthogroup
taxonHUMAN/MOUSE/etc
canon_acccanonical acc
canon_lencanonial sequence length
prop_accproposed canonical (an isoform for this dataset)
prop_lenproposed sequence length
rank_scorescore weighted by prop_cost, n_sp, w_canon, etc. used to rank proposed changes
canon_costcost of clade with the specified taxa with only canonical sequences
prop_costcost of clade if proposed canonicals replace canonicals
n_sp number of SP sequences
n_tax number of taxa (5 max for this data)
n_canon number of canonical sequences in clade
wn_canon weighted number of canonical sequences in clade - human and mouse count 2X, rat 1X, other canonicals are not counted
clade "Inner17" — clade label in gap-distance tree
clade_membersother members of the proposed clade
MANEstatusMANEstatus: MANE_good, MANE_bad, NAM (orthogroup clade has no HUMAN sequence, so no MANE assignment).

Suggestions are ordered from highest rank score to lowest. The rank score weights the number of Swiss-Prot canonicals in the clade (higher is better), the weighted number of canonicals (higher better), the proposed clade score (lower better), the fraction of proposed isoforms in the clade (lower better), All of the proposed costs are < 0.02, and many are 0.00 — indicating all the sequences in the clade align without gaps.

The evolutionary trees associated with the orthogroup are included in the pdf_data/ directory. The leaves of the trees are colored to indicate whether they are part of the clade that has been selected, are consistent or inconsistent with the MANE suggestion, and whether the canonical should be changed.

This example (pdf) shows a case where both MANE and Ortho2tree suggest a change.

PTHR45840:SF40.88100.016753MANE_good
HUMANsp|O75783(438)sp|O75783-2(373)



Here, the title text shows the Panther Orthogroup, the score used to rank the suggested change, the tree-cost of the canonical sequences, and the tree-cost for the proposed mixture of canonical and isoform sequences. Also shown is the gap-distance based tree, and the multiple alignment that produced the gap distances. The multiple alignment shows either aligned residues (blue), which may not be identical, and gapped regions (grey).

In this example, two current canonicals selected for change are highlighted in red, and the two proposed-canonical current-isoforms are shown in green (RAT) and light blue (HUMAN), because the isoform suggestion is also made by MANE. (When MANE and Ortho2tree agree that a canonical should stay the same, the taxon is highlighted in dark blue.)

This example (pdf) shows a case where the MANE suggestion and Ortho2tree disagree.

PTHR20859:SF510.13320.013054MANE_bad
HUMANsp|Q969J5(263)sp|Q969J5-2(231)



In this case, the current canonical is selected by MANE (and highlighted in orange, while the Ortho2tree suggestion is shown in green. The current canonical and MANE-selected isoform has an additional exon that is missing from the MOUSE, RAT, CANLF, and BOVIN sequences, and the proposed HUMAN isoform also lacks that exon.


Taxon colors are highlighted as follows:
bluecurrent canonical included in clade
dark-bluecurrent canonical in clade that agrees with MANE (MANE_good)
greenproposed canonical in clade
light blueproposed canonical in clade supported by MANE (MANE_good)
blackcurrent canonical not included in the clade used to suggest changes
redcurrent canonical that is proposed to be changed (there will be a matching green leaf)
orangecurrent canonical proposed to be changed but is selected by MANE (MANE_bad)

Taxonomic distribution of proposed changes (for the 6-Feb-2024 data):

taxonSP to SPSP to TRTR to TR
BOVIN 5 31 892
CANLF 0 2 19
GORGO 0 0 1708
HUMAN 429 234 0
MONDO 0 0 842
MOUSE 95 211 54
PANTR 2 4 2194
RAT 38 174 870

Thus, almost all of the SwissProt suggested changes are in HUMAN and MOUSE, while the TrEMBL changes are to RAT and BOVIN, and other less well characterized genomes.

Of the suggested changes to HUMAN canonicals, 181 SwissProt-SwissProt (sp_sp) changes are supported by MANE, 243 are not. For the SwissProt-Trembl changes (sp_tr), 394 are supported by MANE, and 229 are not. However, this analysis only shows suggested changes. In orthogroup clades that confirm the current canonicals, ortho2tree agrees with MANE more than 90% of the time.


SwissProt to SwissProt changes (569 suggestions)