This FAQ will cover the most frequent questions: However, this software is offered "as it is" without any warranty and support.






You must provide a scaffolds file (a FASTA file containing the DNA sequences of all the scaffolds). The Contigs file and the Reads file are optional, but they are required for certain options as shown in the following table:

 Check That Reads Support Scaffold StructureIdentify Contigs Excluded From ScaffoldsAnnotate Genome
   Frameshift
correction
by estimation
(inserting Ns)
Frameshift
correction
using reads
Scaffolds file
Contigs file  
Reads file  

↑ Top



The maximum size for uploaded files is 500 Megabytes. Your Scaffolds file should contain no more than 702 sequences. These limits should be enough for a yeast genome draft assembly.

YGAP is currently unable to handle Reads files from Illumina projects because they are too large. However, YGAP can still annotate Scaffolds files from Illumina projects including estimation of frameshift locations – it just inserts Ns to overcome frameshifts instead of consulting the Reads data.
↑ Top



Upload times depend on your Internet connection speed and the size of your files. Uploads using domestic broadband can be slow and the connection can be lost if the upload time exceeds 1 hour. We recommend using a fast connection such as 100 megabit university connection instead. If you are having difficulties uploading your files, please try running YGAP first without a Reads file, or contact us to try to find a solution.
↑ Top



A genome tag is a four character series of letters/numbers that forms the beginning of the gene names for your project. The tag is also used as a short name to identify the project itself.

An example of a genome tag is NCAS, which we used for the Naumovozyma castellii genome project. Genes from this genome have names like NCAS0A01230 and NCAS0E02150.

A genome tag can be any combination of letters and/or numbers, but it must be precisely 4 characters long.

Gene names are made from the genome tag as follows, for example the gene name NCAS0E02150 consists of these parts:
  • NCAS is the genome tag for the species Naumovozyma castellii.
  • 0 after the tag is a mandatory character (reserved for possible future use).
  • E indicates that this gene is on the 5th scaffold (E is the 5th letter of the alphabet). If you choose the option to sort your scaffolds by size, the letter A will be used for the largest scaffold, B for the second-largest, etc, in descending order of size. If you do not choose this option, the letters A, B, C, etc, will refer to the scaffolds in the same order that they were found in your input Scaffolds file. If there are more than 26 scaffolds in your Scaffolds file, YGAP will use two letters to indicate the scaffold, e.g. NCAS0AG01230; the maximum number of scaffolds allowed is 676 (= 26*26).
  • 02150 is a 5-digit number that identifies the gene. Genes are numbered sequentially, in increments of 10, from the beginning of each scaffold. The numbers assigned by YGAP increase in increments of 10 to allow room for future improvements of the annotation; for example if a new gene is discovered between NCAS0E02150 and NCAS0E02160 in the future it could be called NCAS0E02155.
↑ Top



The Scaffolds file should contain nucleotide sequences in a fasta format. Each sequence should correspond to a scaffold or a chromosome and should look like similar to this:

>Scaffold_1
ATTTTAGATATATTCCTCAGAGATCTAGCTCGACGAGAGACTAGGACCACAGATGCAAGGGCTATTTATA
TCGACCAGCTACAACCGCATATATAGCTAGCTACCCAGAGACTTATAGATATTGGTTGTGACTCACGGAG
...
...
TATAGCATCTCCGCAGAGACTTATAGATATTGTTCGAGGAATT
>Scaffold_2
TGCAATTTAGTAATTCCTTTAGAAGCTCGAACCGAGAGACGATGAAATAGCTACTATCTATATACCCACG
TCACACGCTACAACCGCAGGGTACCCAGAGCAGGACTATCAGTGGTTGGCTGATGAACTTCAGAGACGCA
...
...
GCTCCTTAAATATCGTAGAGGCGTCTCTGAATATCTATAACAA
>Scaffold_702
.
.
.


Scaffolds should be long sequences corresponding approximately to complete chromosomes. Each sequence in the Scaffolds file should correspond to several contigs separated by runs of NNNNNN nucleotides.

ATTENTION: The total number of scaffolds should be less than or equal to 702.
↑ Top



The Contigs file should contain nucleotide sequences in a fasta format. Each sequence should correspond to a contig and should look like similar to this:

>7180000009539
TCGCCGTGCAAGCAGAGGCGTTCGAGGAATTTTTCAAGCAAGTTACCGTCTCGTAGGCAACGATACTCCT
CATTGACGATGCATGGCACATCTACGTAATCTCCGTCTTTAGTAATGTGGTGTGAAAAGTCTGTGCATGA
CAATTGTTCATAAGAGGGCAACGCAAGGCAAGTATTGTTCTGACTCTTCAGTTAGGCTTATTATGAAATA
TATTATGGTGGCGGTTTTTTCATAGGTTAAATAGTATAACTATGAAGATTAAAATTGTGAAGAACAAATC
AGAAGGCGTTAATTTCTACCCAAC
>7180000009540
AGTGAAATTAACCATTTGAACAATGGCCAAACACACCAATTAGTCACTAGTGTCCCTCCACCATTGTGTC
TTGATCCTTTGTCTCCACCATCCCTTCATGGTCCCACCTTCCATAATCGTCATACAACTGAAATAGAAGG
GCACCCCCAGGGGTGCTAAATGCCAATTGATCAACACTCACTTTGTAAAGCATGGAGAGGGTCTCTTGTT
ATGTTGAAACCGTAGTGGCAACTTCACGTTATTTTGTAGAATCTTGTACCATTTGTCTCCAATGAACGAG
AAAATCATTGACCCGTAAATAACAGCTCTAGAAGTTCTGTGCCAGTCGAAGTTCTTCTTCTTTTGCTCAA
TCTCATCACCTTGTTTTGGTTGTTCAGTAGGGGCAAATAGTATTTGGGCACTA
.
.
.

↑ Top



The Reads file should contain nucleotide sequences in a fasta format. Each sequence should correspond to a sequence read and should look like similar to this:

>F3R6YUP03G7B7O.f
TGAAATCTAATATACCTTATT
>F3R6YUP03G7B7O.r
GATCACTGCTACAGCATTATGTCTAGTCTTTTTTCGGCATTTTAGCAAGTAAGCCTCGGAGTCTTTTACG
TTTATATTGTTTTGTTGTCTACCTTTAGGTTTGGCTCTTGGAATAGTAGACTCAATCGCTAATGTGTTCA
TAGGAAACCTCGTCATCCATAAAAACGAAATGATGGGAATTATGCATTCAGTTTATGGGGCTGCTGCAAT
GCTTACACCTCCATTGGTAGCTCATTTTTGTTGAGTGGGGACATTGGTCTCTATTTTTTTTGCTCCCCGT
AATAACATCATTCGTGGGAATGTGTTTTATCATACCTGCGTTTAGGTTTGAAACAGAGGCAAAATATGAC
TAT
>F3R6YUP03F8RIK.f
CTTTTTTATCATCATGGGTTCGGATGTAGTAAGGAAACGTGGAGTCTGAACTTTAGTATTTTCAAAATTG
GACTGATAATTGATG
>F3R6YUP03F8RIK.r
TGGTAAAGAGGGAATGATGCTAAACCT
.
.
.


The .f and .r suffixes indicate paired reads that are expected to be close together in the genome. If you choose the option “Check that reads support scaffold structure” (see below), these suffixes are used to identify the read-pairs. No other suffixes are supported.
↑ Top



This option summarizes the evidence supporting the scaffold structure of the genome you uploaded. It generates a matrix showing the number of sequence read-pairs (.f and.r reads with the same name) that suggest a physical link between two contigs. Contigs that are neighbors in the scaffold should have a lot of read-pair connections. Negative numbers in the matrix show pairs of reads that both map inside the same contig and so do not provide any information about scaffold structure. Only sequence reads that have a unique match in the genome are counted in this analysis.Check which files do you need to run this option.
↑ Top



This program looks for 'lonely contigs' that are in the Contigs file but are not included in any of the scaffolds in the Scaffolds file. We use the CAP3 assembler to try to merge these lonely contigs into larger contigs. If there are long sequences in these files, it may indicate a problem with the scaffold structure.
↑ Top



A whole-genome duplication (WGD) event occurred in yeasts about 100 million years ago (Wolfe and Shields, 1997). The pipeline needs to know whether your species is a post-WGD or a non-WGD species, in order to make a good annotation. Post-WGD species can have two genes per YGOB pillar, whereas non-WGD species can have only one gene per pillar. If you don’t know the status of your species, try guessing that it is the same as its closest relatives.

ygobphylogenetictree
↑ Top



Next-generation genome sequences frequently contain insertion/deletion errors that cause frameshift errors in predicted gene structures, resulting in incorrect amino acid sequences and premature stop codons. YGAP’s annotation is based on TBLASTN searches, in which proteins from other well-annotated species are searched against the DNA sequence of the new genome. Frameshift errors in the new genome will result in TBLASTN hits that use two (or more) different reading frames.

YGAP offers 3 options for dealing with genes in which it detects apparent frameshift errors:
  • No correction. For genes that appear to contain a frameshift error, YGAP will assign a name and approximate location to the gene and report its DNA sequence, but the protein translation will be garbage. A list of these ‘probable frameshift’ genes is reported. This option does not change the Scaffolds sequence file.
  • Estimate. We recommend using this option if you do not have a Reads file. For genes that appear to contain a frameshift error, YGAP will assign a name to the gene, estimate the position of the frameshift error and insert one or two unknown nucleotides (N or NN) into the scaffold to fix the reading frame. Frameshift site estimation is based on the locations and frames of BLAST HSPs, and the N or NN nucleotides are preferentially added to mononucleotide tracts where possible. With this option, predicted protein translations are given for genes with frameshifts. This option will change the Scaffolds file (N or NN bases will be added at predicted frameshift sites. No bases will be deleted). Users will need to download the modified Scaffolds file after running YGAP.
  • Reads. This is the most precise option and we recommend using it if you have a Reads file. For genes that appear to contain a frameshift error, YGAP will assign a name to the gene and estimate the position of the frameshift error. It will then search through the Reads file looking for primary sequence reads that differ from the scaffold sequence by insertion/deletion differences that can restore the correct reading frame. If suitable reads are found, YGAP will change the corresponding scaffold sequence in the Scaffolds file (inserting or deleting bases as necessary). This option will change the Scaffolds file (real bases will be added or deleted at predicted frameshift sites). Users will need to download the modified Scaffolds file after running YGAP.

ATTENTION: Using frameshift correction changes the sequences of your scaffolds. Read this section carefully.
↑ Top



By default we order the scaffolds in decreasing order of size. The biggest scaffold is called chromosome 1 and genes on this scaffold are given names containing the letter A. The second-biggest scaffold is called chromosome 2 and its genes contain the letter B, and so on. If you deselect this option, the scaffolds will be named in the same order that they appear in the Scaffolds file you uploaded. See the mapping file.
↑ Top



The mapping file lists the correspondence between: your original scaffold names (i.e. the title lines in your Scaffolds fasta file), the chromosome numbers assigned by YGAP, and the letters used for gene naming.
↑ Top



This program looks for 'lonely contigs' that are in the contigs file but are not included in any of the scaffolds in the scaffolds file. We use the CAP3 assembler to try to merge these lonely contigs into larger contigs. If there are long sequences in these files, it may indicate a problem with the scaffold structure.
↑ Top



This browser shows your genome, compared to three reference genomes: Saccharomyces cerevisiae, Ashbya gossypii and the inferred ancestral yeast genome. The mini-YGOB is intended to give you an indication of the annotation quality. If you like what you see, contact us and we can import your genome into the real YGOB.
↑ Top



If you selected a frameshift correction option ('‘Reads’ or ‘Estimate’), YGAP probably made some automated corrections to your genome sequence. This file contains a modified version of your Scaffolds file with all those corrections.
↑ Top



The columns are:
  • Column 1: Gene name.
  • Column 2: Strand.
  • Column 3: Lowest coordinate of the gene.
  • Column 4: Highest coordinate of the gene.
  • Column 5: YGOB browser status (all ON).
  • Column 6: Chromosome (scaffold) number.
  • Column 7: Gene short name.
  • Column 8: Gene coordinates (including introns).
  • Column 9: Name of orthologous Ancestral gene (if present).
  • Column 10: Name of homologous S. cerevisiae gene.
  • Column 11: Gene type.
  • Column 12: YGOB pillar number.
  • Column 13: Annotation tag (Cor / Fra / Man / NNN) – see below.
  • Column 14: Annotation route (simple / multi / GETORF / DOGS) – see below.


  • Annotation tags:
  • ‘Cor’ means that the gene underwent successful automatic frameshift correction, resulting in a translatable sequence.
  • ‘Fra’ means that the gene contains a probable uncorrected frameshift error. This could either be because YGAP was unable to correct the error, or because you did not select the frameshift correction option.
  • ‘Man’ means that the gene has been flagged for manual attention. These are either untranslatable, or they are suspicious because of their length or because they overlap other genes.
  • ‘NNN’ means that the coding region of the gene begins or ends in a run of NNN nucleotides (i.e. in a gap between two contigs in a scaffold).


  • Annotation routes:
  • Simple means that the gene was easily assigned to a YGOB pillar by YGAP.
  • Multi means that the gene is a member of a multigene family and its pillar assignment may be unreliable.
  • GETORF means that the gene was annotated only because it contains a large ORF.
  • DOGS means that the gene was annotated by the SearchDOGS step.

  • ↑ Top



    All tRNA gene annotations are generated using tRNAScan-SE (Lowe & Eddy, NAR 25:955, 1997) with default options.
    ↑ Top



    In order to help you to examine the quality of an annotation, YGAP generates several of lists of genes in particular categories including:
    • Ancestral genes with no ortholog in your genome: This is a list of the genes present in the ‘Ancestral’ yeast genome (see Gordon et al, PLOS Genetics 5:e1000485, 2009) that do not have any ortholog annotated in your genome. For a good genome sequence and annotation, this list should be fairly short and the listed pillars may require manual checking. Note that for a post-WGD genome, this check only requires the species to have 1 gene in the pillar.
    • Genes annotated only because they contain a large ORF: One of the final steps in the YGAP pipeline is to search for any large ORFs (> 150 amino acids) that do not overlap with other genes and were not otherwise annotated as genes. This output is a list of the genes annotated by this step. They are not assigned to any pillar in the mini-YGOB browser. This list may include artifacts such as fragments of genes that contain uncorrected frameshift errors, or pseudogenes, or highly divergent genes that should actually have been placed in a YGOB pillar.
    • Genes that may be very divergent from their orthologs: This is a list of genes that were detected only by SearchDOGS (OhEigeartaigh et al, BMC Genomics 12:377, 2011). SearchDOGS uses synteny and similarity to detect genes and is particularly effective for annotating small and highly divergent genes.
    • Genes in your genome that are singleton pillars in YGOB: This is a list of genes that are in singleton pillars in YGOB. These genes are not necessarily 'unique' as they can correspond to gene duplicates for which no homolog exists in a syntenic position in any of the YGOB species. However, this list should also contain members of any species-specific gene families, or genes that were gained by horizontal gene transfer.

    ↑ Top



    When the appriate option is selected, the pipeline would try to fix all the frameshifts corrections detected. Those that can be corrected and the resulting gene can bet ranslated to a complete protein are tagged as 'Cor' in the annotation file presented in this list. You can either accept those corrections or manually edit the scaffold yourself.

    ATTENTION: A frameshift correction involves a change in the scaffolds sequence so if you manually edit the scaffold sequence to discard any of the suggested corrections, be careful that the coordinates of all the subsequent annotated genes might be wrong.
    ↑ Top



    YGAP creates different sublists of genes that have been annotated automatically but need some manual attention.
    • Genes with an uncorrected frameshift: These genes, tagged as 'Fra' in the annotation file, have been identified as having a frameshift but could not be corrected either because the correction option was deselected or because the pipeline had insufficient data to fix it.
    • Genes with an untranslatable sequence: These genes, tagged as 'Man' in the annotation file, cannot be translated correctly, e.g. because they contain multiple stop codons.
    • Genes that are translatable but are suspicious because of their length or an overlap: These genes, also tagged as 'Man' in the annotation file, can be translated but need attention either because (1) they are unexpectedly short, by comparison to the homologs in their pillar, or (2) they overlap with another gene. The most probable cause of this situation is an erroneous start codon prediction, either in the gene itself or in a neighboring gene.
    • Genes that begin or end at an NNN sequence: These genes, tagged as 'NNN' in the annotation file, cannot be translated properly because they either start or stop in an undefined region (a run of N nucleotides due to a gap between contigs). Nevertheless, the rest of the gene appears to be OK.

    ↑ Top