Index of /wrpearson/fasta/data/gonzalez09a/nar2010

Name	Last modified	Size

Parent Directory		-
trees.tgz	2009-12-30 14:49	84M
FAQ.txt	2010-01-03 15:46	1.8K
family_members.annot.gz	2010-03-03 10:25	5.2M
queries.tgz	2010-03-29 13:54	787K
pfam_to_clan.txt	2010-03-29 13:55	9.3K
library_all_domains...>	2010-03-29 13:56	56M
library_all_domains_..>	2010-03-29 13:56	63M
library_long_domains..>	2010-03-29 13:56	35M
library_long_domains..>	2010-03-29 13:56	38M
CONTENTS	2010-06-22 13:49	12K
release_date.txt	2010-06-22 13:49	35

29-December-2009

                         RefProtDom
						 
The files in this directory contain a set of highly divergent families
that were originally selected from Pfam (v.21) and then manually
curated to supplement the homologies identified in Pfam (M. W. Gonzalez
and W. R. Pearson. RefProtDom: A Protein Database with Improved Domain 
Boundaries and Homology Relationships. Manuscript in Preparation).

Six types of files are provided: 
1.  Reference library sequence files
2.  Annotation files (list homologous domains in the reference libraries) 
3.  Supplementary Annotation Files
4. 	A tar-gzip file with sets of query sequences
5.  A tar-gzip file with the trees and multiple sequence alignments for the superfamilies
6.  A file of the most frequently-asked questions (FAQ.txt)

================ 1. Reference libraries ================

library_all_domains.fa.gz - full-length Uniprot proteins 
  containing homologs to the query domains.

library_all_domains_rdm.fa.gz - Random-shuffles of each of 
  the full-length Uniprot proteins in library_all_domains.fa.gz.

library_long_domains.fa.gz - A subset of the library_all_domains.fa.gz
  library from which proteins with homologous domains less than 75% of
  the Pfam model length are excluded.

library_long_domains_rdm.fa.gz -  Random-shuffles of each of 
  the full-length Uniprot proteins in library_long_domains.fa.gz

================ 2. Annotation files ==================

family_members.annot.gz - lists the domains in each sequence in the
  library_*_domains.fa.gz files.

Format:
>[source]|[accession]|[sequence_name]
[superfamily][domain_start][domain_end][e-value][mode][long_domain][non_redundant]

Examples:
>up|Q9WY68|1A1D_THEMA
PF00291 3       300     1.9e-08 ls      0       1
PF00291 3       33      0.0047  fs      0       0
PF00291 70      300     0.014   fs      0       0
>pfam21|P70122|SBDS_MOUSE
PF01172 3       243     2e-107  ls      1       0

source			"up" if the sequence matches the current (12/2009) Uniprot version of the sequence
				"pfam21" if the exact sequence is no longer in uniprot, the pfam v.21 sequence is used
			  
superfamily 		The pfam accession name (PF##### when the family was the
				sole representative of its superfamily/clan) or clan
				number (CL[clan_id] when the superfamily has several
				families that have been coalesced into one homologous
				group)
			  
domain_start  	Sequence coordinate where domain starts

domain_end    	Sequence coordinate where domain ends

e-value 			The score of the comparison between the sequence fragment from
				domain_start to domain_end against the HMM model of the given
				family or the e-value generated by the supplemental annotation
				methods described in (Gonzalez, M.W. and W.R. Pearson, 2010b)

mode				The type of pfam HMM model used to identify the given domain or the 
				supplemental annotation method described in (Gonzalez, M.W. and W.R. Pearson, 2010b)
		        "pf21ls" mode domains match the entire footprint on the pfam domain model
				"pf21fs" domains are usually fragments that only partially match the pfam domain model
				"ext" are domains that were previously annotated as partial homologies whose coordinates we extended
				"ua" mode refers to previously missed homologs found by Gonzalez and Pearson

long_domain 		"0" if the sequence contains domains whose lengths are
	     			 <75% of the Pfam model length. long_domain=0 sequences
					 are only found in library_all_domains.fa.gz
				"1" if the sequence only contains domains whose lengths
					 are >=75% of the Pfam model length. long_domain=1
					 sequences are found in library_long_domains.fa.gz and in
					 library_all_domains.fa.gz

non_redundant  Useful to calculate family size
			   "0" flags a redundant domain that overlaps with another with
				   	longer sequence homology annotation
			   "1" flags the non-redundant domain with the longer sequence
			   		homology annotation

======================= 3. Supplementary Annotation files  ==================================

pfam_to_clan.txt - Lists the pfam family to clan superfamily correspondence.
	Note: The annotations on this database are at the superfamily level, which we recommend for
	homology evaluation.	See the FAQ.txt and (Gonzalez and Pearson, NAR, 2010) for more details of why
	coalescing superfamilies is the preferred choice when evaluating homology.
	
refprotdom_domain_bound_ext.txt - Lists the domains that in pfam v.21 were annotated 
	as partial homologies whose coordinates we extended. Current uniprot accessions 
	and sequence ids are provided, as well as the corresponding pfam v.24 coordinates
	Seq_id 				Sequence identifier
	Status  				"available" if sequence is avaiable in current Uniprot and in pfam24
						"dropped_by_p24" if sequence is avaiable in current Uniprot but not in pfam24
						"demerged" if sequence is avaiable in current Uniprot but has been broken into
							several shorter sequences
	Pfam    				Pfam name
	R_start/R_end   		RefProtDom domain boundaries
	SS_start/SS_end  	SSEARCH domain boundaries
	GL_start/GL_end  	GLSEARCH domain boundaries
	v21_start/v21_end 	Pfam v.21 domain boundaries
	v24_start/v24_end 	Pfam v.24 domain boundaries
	Fixed_by_24     		"1" if Pfam v.24  boundaries are within 10% of RefProtDom boundaries
	Fixed_annot_cov		The fraction of domain overlap between pfam v.24 and RefProtDom boundaries

refprotdom_unannot_homol.txt - Lists missed/unannotated homologs in pfam v.21 that we uncovered with 
	reverse PSI-BLAST searches or through SCOP/CATH structural evidence.
	Seq_id 				Sequence identifier
	Evidence        		"rev" if the homology was uncovered using reverse PSI-BLAST searches
						"str" if the homology was uncovered with SCOP/CATH structural evidence
	Pfam    				(Same as *ext.txt" above)
	R_start/R_end   		(Same as *ext.txt" above)
	Status  				(Same as *ext.txt" above)
	v24_start/v24_end	(Same as *ext.txt" above)
	Fixed_by_v24    		(Same as *ext.txt" above)
	Fixed_annot_cov		(Same as *ext.txt" above)
					 
=========================== 4. Query sequences =================================			  

queries.tgz is a gzip-ed tar file that produces the following directories:

  queries/by_difficulty/
  queries/by_tree_location/

In "queries/by_difficulty/", there are two classes of query sequence
files, each of which contains 50 domain sequences, in 10 different
random sequence embeddings.

   hard_embedded.[1-10].fa
   sampled_embedded.[1-10].fa

In addition, there is a

   hard_non_embedded.fa   and
   sampled_non_embedded.fa  file.

"hard" domains are domains that find the smallest number of related
sequences after a BLASTP search.  "sampled" domains were chosen at
random from 640 domains selected because of their length (>200
residues in the Pfam model) and phylogenetic diversity (homologs in
2 of the 3 kingdoms of life: e.g. homologs in archaea and eukarya, 
or in archaea and bacteria, etc).

"queries/by_tree_location/", also contains two classes of query
sequence files, each with 50 domain sequences, in 10 different random
sequence embeddings.  Here, the classes are "des", for queries from
relatively deserted parts of the domain phylogenetic tree, and "pop",
for queries from a populated region.

QUERY EMBEDDING

All queries are available as bare domains (non-embedded/ne) or flanked
by artificial proteins (embedded/e#). The embeddings were created by
randomly shuffling the domain as described by Gonzalez and
Pearson. For each domain, 10 different embedding replicates are
provided. Unless otherwise specified, the results in (Gonzalez and
Pearson, NAR, 2010) are based on embedding #5.

QUERY FILE NAMES	

A query is a sequence domain from a family that falls under any of the
4 types of queries described above (i.e. hard, random, populated/pop,
deserted/des).  The four types of query families are available
as bare domains (non_embedded) or embedded in 10 different shufles
following the following naming format: [type]_[embedding].[e#].fa. For
example: "hard_embedded.5.fa" contains 50 embedded queries from
hard families and the embedding is the 5th shuffle of the
domain (there are 9 alternate embeddings for the same domain).

QUERY FILE FORMAT 

Query files are in FASTA format, with the description line providing
information about the location of the domain, and its origin.
Each query file contains 50 queries of the form:

>qPF00589_e5 e_d_start:96 e_d_end:286 from:up|Q1YWW7|Q1YWW7_PHOPR(194-384); pfam:PF00589; model_len:205; all_homol:1445; long_homol:963; descr:Phage integrase family
KTKKSAKQSDL.... [sequence] ....

The format of the description line is:

>[query_accession] e_d_start:# e_d_end:# from:[sequence_id]([domain_start]-[domain_end]); pfam:[pfam_superfamily]; model_len:[#]; all_homol:[#]; long_homol:[#]; descr:[description]

query_accesion Accession number for each query in the format:
               q[pfam_superfamily]_[ne|e#].  For example: qPF00589_e5 is
               a domain from PF00589 that has been embedded in the 5th
               shuffle replicate.

e_d_start/     The boundaries of the Pfam domain in the query sequence
e_d_end

from           The original pfamseq_id (Uniprot id) and coordinates
	           (start-end) of the query domain. 

pfam           The pfam accession name (PF#####) or clan number (CL###). No Clan accession 
               is given when a superfamily contains a single Pfam family.

model_len      The length of the pfam domain model

all_homol      Number of homologs that this family has in the "library_all_domains.fa" library 

long_homol     Number of homologs that this family has in "library_long_domains.fa"

descr          Description of the Pfam domain

======================= 5. Trees and MSAs ==================================

trees.tgz is a gzip-ed tar file that produces the following directories:

  trees/all_domains_in_family/
  trees/long_domains_in_family/

"trees/all_domains_in_family/" contains trees of all domain members of each superfamily
"trees/long_domains_in_family/" contains trees of the long-domain members of each superfamily

All *.tree files in the "trees/" folder are newick formatted, neighbor-joining trees of a set of
members for each superfamily (i.e. trees that feature all domain members in a superfmaily or 
trees of only the long-domain members). 

The .*afa files contain the multiple sequence alignments used to generate the trees

All superfamily trees were generated using Quicktree
(v. 1.1), and the multiple sequence alignments to generate them were created using the
HMMMER (v. 2.3.2) package. 

================================================================
6. Using the files -- evaluating search alignment accuracy

To determine whether the alignments are True positives (TPs) or False
Positives (FPs) all you need to know is the library sequence's id 
(e.g. up|Q1YWW7|Q1YWW7_PHOPR) and the pfam superfamily to
which the query belongs (e.g. qPF00589_e5's superfamily is PF00589).

Find the library sequence in the "family_members.annot.gz" file
(alternatively this information may be stored in mySQL tables) and
compare the domain boundaries there to the alignment coordinates of
the similarity searching algorithm you are testing.  For instance,
let's assume you're testing the PF00589 superfamily (using qPF00589e5:
a query from hard_embedded.5.fa) and suppose your algorithm finds a
putative homolog on the "XERC_BACSU" sequence from residues 10-80.
Looking at the "family_members.annot" file you would classify this
alignment as a false positive (FP) because the alignment maps 100% to 
the unrelated PF02899 domain.  You may decide to use a specific overlap
percentage to classify the alignments. In (M. W. Gonzalez
and W. R. Pearson. Homologous Overextension: A Challenge for Iterative
Similarity Searches. NAR, accepted. 2009),we require 50% alignment overlap
to the homologous region (i.e. at least 50% of the alignment must be 
between 114-291) for the alignment to be counted as a true positive (TP).

>up|P39776|XERC_BACSU
PF02899 8       91      1.6e-26 ls      1       0
PF00589 114     291     1.6e-65 ls      1       1


================================================================

For more information, contact Bill Pearson (pearson@virginia.edu) or
Mileidy Gonzalez (mileidygwgonzalez@gmail.com)