FAQ

How long does the upload take?

Upload times depend on your Internet connection speed and the size of your files. Uploads using domestic broadband can be slow and the connection can be lost if the upload time exceeds 1 hour. We recommend using a fast connection such as 100 megabit university connection instead. If you are having difficulties uploading your files, contact us to try to find a solution.

What is a 'genome tag', and how are genes named?

A genome tag is a four character series of letters/numbers that forms the beginning of the gene names for your project. The tag is also used as a short name to identify the project itself.

An example of a genome tag is NCAS, which we used for the Naumovozyma castellii genome project. Genes from this genome have names like NCAS0A01230 and NCAS0E02150.

A genome tag can be a combination of letters and/or numbers, but it must be precisely 4 characters long and must start with a letter.

Gene names are made from the genome tag as follows, for example the gene name NCAS0E02150 consists of these parts:

NCAS is the genome tag for the species Naumovozyma castellii.
0 after the tag is a mandatory character (reserved for possible future use).
E indicates that this gene is on the 5th scaffold (E is the 5th letter of the alphabet). If you choose the option to sort your scaffolds by size, the letter A will be used for the largest scaffold, B for the second-largest, etc, in descending order of size. If you do not choose this option, the letters A, B, C, etc, will refer to the scaffolds in the same order that they were found in your input Scaffolds file. If there are more than 26 scaffolds in your Scaffolds file, YGAP will use two letters to indicate the scaffold, e.g. NCAS0AG01230; the maximum number of scaffolds allowed is 676 (= 26*26).
02150 is a 5-digit number that identifies the gene. Genes are numbered sequentially, in increments of 10, from the beginning of each scaffold. The numbers assigned by YGAP increase in increments of 10 to allow room for future improvements of the annotation; for example if a new gene is discovered between NCAS0E02150 and NCAS0E02160 in the future it could be called NCAS0E02155.

What format is required for the Scaffolds file?

The Scaffolds file should contain nucleotide sequences in a fasta format. Each sequence should correspond to a scaffold or a chromosome and should look like similar to this:

>Scaffold_1
ATTTTAGATATATTCCTCAGAGATCTAGCTCGACGAGAGACTAGGACCACAGATGCAAGGGCTATTTATA
TCGACCAGCTACAACCGCATATATAGCTAGCTACCCAGAGACTTATAGATATTGGTTGTGACTCACGGAG
...
...
TATAGCATCTCCGCAGAGACTTATAGATATTGTTCGAGGAATT
>Scaffold_2
TGCAATTTAGTAATTCCTTTAGAAGCTCGAACCGAGAGACGATGAAATAGCTACTATCTATATACCCACG
TCACACGCTACAACCGCAGGGTACCCAGAGCAGGACTATCAGTGGTTGGCTGATGAACTTCAGAGACGCA
...
...
GCTCCTTAAATATCGTAGAGGCGTCTCTGAATATCTATAACAA
>Scaffold_702
.
.
.

Scaffolds should be long sequences corresponding approximately to complete chromosomes. Each sequence in the Scaffolds file should correspond to several contigs separated by runs of NNNNNN nucleotides.

ATTENTION: The total number of scaffolds should be less than or equal to 702.

ATTENTION: Only A,T,C,G,N nucleotide symbols can be processed by YGAP (upper or lowercase is fine). If your fasta file contains ambiguous nucleotide symbols (i.e. R,Y,S,W,K,M,B,D,H,V) they should be replaced for Ns before uploading to YGAP.

File names should not contain any spaces (the run won't complete properly if they do)

Is your species a 'Post-WGD' or a 'Non-WGD' species?

A whole-genome duplication (WGD) event occurred in yeasts about 100 million years ago (Wolfe and Shields, 1997). The pipeline needs to know whether your species is a post-WGD or a non-WGD species, in order to make a good annotation. Post-WGD species can have two genes per YGOB pillar, whereas non-WGD species can have only one gene per pillar. If you don't know the status of your species, try guessing that it is the same as its closest relatives.

ygobphylogenetictree

Frameshift correction methods

Next-generation genome sequences frequently contain insertion/deletion errors that cause frameshift errors in predicted gene structures, resulting in incorrect amino acid sequences and premature stop codons. YGAP’s annotation is based on TBLASTN searches, in which proteins from other well-annotated species are searched against the DNA sequence of the new genome. Frameshift errors in the new genome will result in TBLASTN hits that use two (or more) different reading frames.

YGAP offers 2 options for dealing with genes in which it detects apparent frameshift errors:

No correction. For genes that appear to contain a frameshift error, YGAP will assign a name and approximate location to the gene and report its DNA sequence, but the protein translation will be garbage. A list of these ‘probable frameshift’ genes is reported. This option does not change the Scaffolds sequence file.
Estimate. We recommend using this option if you do not have a Reads file. For genes that appear to contain a frameshift error, YGAP will assign a name to the gene, estimate the position of the frameshift error and insert one or two unknown nucleotides (N or NN) into the scaffold to fix the reading frame. Frameshift site estimation is based on the locations and frames of BLAST HSPs, and the N or NN nucleotides are preferentially added to mononucleotide tracts where possible. With this option, predicted protein translations are given for genes with frameshifts. This option will change the Scaffolds file (N or NN bases will be added at predicted frameshift sites. No bases will be deleted). Users will need to download the modified Scaffolds file after running YGAP.

ATTENTION: Using frameshift correction changes the sequences of your scaffolds. Read this section carefully.

What does 'order scaffolds by size' do?

By default we order the scaffolds in decreasing order of size. The biggest scaffold is called chromosome 1 and genes on this scaffold are given names containing the letter A. The second-biggest scaffold is called chromosome 2 and its genes contain the letter B, and so on. If you deselect this option, the scaffolds will be named in the same order that they appear in the Scaffolds file you uploaded. See the mapping file.

What are the columns in the annotation file?

The columns are:

Column 1: Gene name.

Column 2: Strand.

Column 3: Lowest coordinate of the gene.

Column 4: Highest coordinate of the gene.

Column 5: YGOB browser status (all ON).

Column 6: Chromosome (scaffold) number.

Column 7: Gene short name.

Column 8: Gene coordinates (including introns).

Column 9: Name of orthologous Ancestral gene (if present).

Column 10: Name of homologous S. cerevisiae gene.

Column 11: Gene type.

Column 12: YGOB pillar number.

Column 13: Annotation tag (Cor / Fra / Man / NNN) – see below.

Column 14: Annotation route (simple / multi / GETORF / DOGS) – see below.

Annotation tags:

‘Cor’ means that the gene underwent successful automatic frameshift correction, resulting in a translatable sequence.

‘Fra’ means that the gene contains a probable uncorrected frameshift error. This could either be because YGAP was unable to correct the error, or because you did not select the frameshift correction option.

‘Man’ means that the gene has been flagged for manual attention. These are either untranslatable, or they are suspicious because of their length or because they overlap other genes.

‘NNN’ means that the coding region of the gene begins or ends in a run of NNN nucleotides (i.e. in a gap between two contigs in a scaffold).

Annotation routes:

Simple means that the gene was easily assigned to a YGOB pillar by YGAP.

Multi means that the gene is a member of a multigene family and its pillar assignment may be unreliable.

GETORF means that the gene was annotated only because it contains a large ORF.

DOGS means that the gene was annotated by the SearchDOGS step.

What are the quality control lists?

In order to help you to examine the quality of an annotation, YGAP generates several of lists of genes in particular categories including:

Ancestral genes with no ortholog in your genome: This is a list of the genes present in the ‘Ancestral’ yeast genome (see Gordon et al, PLOS Genetics 5:e1000485, 2009) that do not have any ortholog annotated in your genome. For a good genome sequence and annotation, this list should be fairly short and the listed pillars may require manual checking. Note that for a post-WGD genome, this check only requires the species to have 1 gene in the pillar.
Genes annotated only because they contain a large ORF: One of the final steps in the YGAP pipeline is to search for any large ORFs (> 150 amino acids) that do not overlap with other genes and were not otherwise annotated as genes. This output is a list of the genes annotated by this step. They are not assigned to any pillar in the mini-YGOB browser. This list may include artifacts such as fragments of genes that contain uncorrected frameshift errors, or pseudogenes, or highly divergent genes that should actually have been placed in a YGOB pillar.
Genes that may be very divergent from their orthologs: This is a list of genes that were detected only by SearchDOGS (OhEigeartaigh et al, BMC Genomics 12:377, 2011). SearchDOGS uses synteny and similarity to detect genes and is particularly effective for annotating small and highly divergent genes.
Genes in your genome that are singleton pillars in YGOB: This is a list of genes that are in singleton pillars in YGOB. These genes are not necessarily 'unique' as they can correspond to gene duplicates for which no homolog exists in a syntenic position in any of the YGOB species. However, this list should also contain members of any species-specific gene families, or genes that were gained by horizontal gene transfer.

Which genes have been automatically corrected?

When the appriate option is selected, the pipeline would try to fix all the frameshifts corrections detected. Those that can be corrected and the resulting gene can bet ranslated to a complete protein are tagged as 'Cor' in the annotation file presented in this list. You can either accept those corrections or manually edit the scaffold yourself.

ATTENTION: A frameshift correction involves a change in the scaffolds sequence so if you manually edit the scaffold sequence to discard any of the suggested corrections, be careful that the coordinates of all the subsequent annotated genes might be wrong.

Which genes need manual attention?

YGAP creates different sublists of genes that have been annotated automatically but need some manual attention.

Genes with an uncorrected frameshift: These genes, tagged as 'Fra' in the annotation file, have been identified as having a frameshift but could not be corrected either because the correction option was deselected or because the pipeline had insufficient data to fix it.
Genes with an untranslatable sequence: These genes, tagged as 'Man' in the annotation file, cannot be translated correctly, e.g. because they contain multiple stop codons.
Genes that are translatable but are suspicious because of their length or an overlap: These genes, also tagged as 'Man' in the annotation file, can be translated but need attention either because (1) they are unexpectedly short, by comparison to the homologs in their pillar, or (2) they overlap with another gene. The most probable cause of this situation is an erroneous start codon prediction, either in the gene itself or in a neighboring gene.
Genes that begin or end at an NNN sequence: These genes, tagged as 'NNN' in the annotation file, cannot be translated properly because they either start or stop in an undefined region (a run of N nucleotides due to a gap between contigs). Nevertheless, the rest of the gene appears to be OK.