SearchDOGS Bacteria README/Users Guide:


SearchDOGS Bacteria is designed to identify missed genes in the annotations of bacterial species and strains. 
The software identifies ORFs likely to represent bona fide genes through a combination of sequence similarity searches tools to identify of conserved synteny. 
It is freely downloadable. For more information, please see PUBLICATION:[Link to paper when published]


Contents:

1.	Installation
2.	Genome inputs
3.	Results
4.	FAQ

------------------------------------------

1. Installation:

SearchDOGS Bacteria is designed to run on a Linux system or a Linux-like environment. 
It consists of a perl program made up of five perl scripts (Launch_SearchDOGS.pl, Create_structures.pl, Find_ORFs.pl, Create_output.pl and copygaps.pl). 
These need to be included in the same folder. The program is run using Launch_searchDOGS.pl.

Necessary components:

The SearchDOGS Bacteria script set (downloadable from http://wolfe.ucd.ie) 

Perl (downloadable from http://www.perl.org/get.html)

BLAST (SearchDOGS Bacteria is currently designed to work with standard or Legacy BLAST rather than BLAST+)  (1) 
(http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)

Bio::Perl (specifically Bio::Seq and Bio::IO) (2) 
(http://www.bioperl.org/wiki/Main_Page)

The ClustalO sequence evolution software suite (3)  
(http://www.ebi.ac.uk/Tools/msa/clustalw2/#)

The PAML sequence evolution software suite (4)  
(http://abacus.gene.ucl.ac.uk/software/paml.html)

If you have UBUNTU:

Bioperl, BLAST, ClustalO can be installed as packages (using synaptic package manager or apt-get), but PAML must be installed separately from the developers website. 
*See our special directions for PAML at the end of this document below (warning: instructions provided by developer have a small fault).


It is recommended that final executables should be put in: /usr/bin/

------------------------------------------

2. Genome Inputs:

SearchDOGS is designed to work on sets of two or more genomes. 
For analysis, the genomes are required to be in standard GenBank (.gb) format; examples can be downloaded from resources such as http://www.ncbi.nlm.nih.gov/genome. 
Genomes to be analysed should be placed in a subfolder called "Genomes" withing the SearchDOGS Bacteria folder. 
In this zip file, the folder "Genomes" contains three genomes to test the software - E. coli K12 (ECKT), Salmonella Enterica (SETY), E.coli HS (ECHS).
Please delete or move these .gb files out of "Genomes" if you do not wish the software to use them.

------------------------------------------

3. Results:

SearchDOGS produces outputs in the form of html pages, stored in an automatically-produced folder, Results. This folder contains the following:

-An overview html page SPECIES_DATE.html (where DATE is the date the software was run on) 
-A page of results for each species/strain included, Results_GENOME_DATE.html (where GENOME is the 4-letter tag chosen by the user to identify the genome.


The Results html pages provide a list of all the genomic locations at which the SearchDOGS Bacteria software has identified a hit at a conserved genomic location. 
Please note that only a subset of ORFs identified are likely to correspond to genuine unannotated genes. 
The software is designed to avoid false negatives while requiring the user to use the information provided to reject false positives. 

ORFs identified by SearchDOGS take the form:
GeneA-GeneB
Where gene A and gene B are identfied by (4 letter species tag)_(Protein id)
e.g.
ECSA_NP_313245.1-ECSA_NP_313246.1
This indicates that an unannotated ORF exists in the intergenic region between gene A and gene B.


In order to provide the user with the necessary information to retain the hit as a candidate missing gene or discard as a spurious hit/pseudogene, a number of columns of data are provided.

(i): 	Pillar: Describes the syntenic neighbourhood of the BLAST hit from the perspective of the species containing the hit. 
	The intergenic regions of the genomic segments highlighted in red hit proteins in a syntenic ortholog pillar in a BLASTX search and contain a candidate gene. 
	Each row names the genes flanking this pillar in each species if their syntenic context is conserved.
	The arrows indicate orientation of the genes (from the relative "perspective" of the species/strain in which the ORF has been found)
	Minus number for distance (nt) indicates an overlap between the ORF and the adjacent genes.
	
e.g. 	adjacent left	Distance(nt) to adjacent gene	pillar hit (8721)		Distance(nt) to adjacent gene	adjacent right
	ECKT_AAC74864.2 ->	3		ECKT_ACO59995.1 ->				59		ECKT_AAC74865.1 <-
	ECSA_NP_310530.1 ->	206		ECSA_NP_310530.1-ECSA_NP_310531.1 ->		59		ECSA_NP_310531.1 <-
	ECSE_YP_002391574.1 ->	29		ECSE_YP_002391574.1-ECSE_YP_002391575.1 ->	59		ECSE_YP_002391575.1 <-

(ii): 	Blast results: Two full BLAST results are provided. 
	GeneA-GeneB:
	The first hyperlink links to the result of a BLASTP search using the protein sequence of the candidate ORF (start to stop) against a database of the other orthologs making up the corresponding pillar. 
	- a genuine protein is expected to hit its orthologs unless the locus is highly diverged. 
	
	Original BLASTX result:
	A BLASTX result of the intergenic region searched against the protein sequences of the pillar orthologs is also provided. 
	- these are the results of the initial BLAST search used by SearchDOGS to identify the candidate genes. 

(iii): 	Hit by in TBLASTN:The nucleotide sequences of the genes in the homology pillar are used in a TBLASTN search against the nucleotide sequence candidate ORF
	- genes that hit the candidate are listed.

(iv): 	Coordinates: Stop-to-stop and start-to-stop coordinates are provided for each ORF, as well as information on whether a frameshift correction was required to make a full-length ORF in a single frame. 
	In cases where alternative (non-ATG) start codons are annotated for pillar orthologs, the start of the HSP generated in the BLASTX search is used as an approximate start point for the start-to-stop gene 
	- user analysis will be required to identify the most likely start codon. 

e.g.	Candidate stop to stop ORF:
	4743876..4744001 (link to nucleotide sequence provided)

	Start to stop ORF
	4743951..4744001

	Intergenic Sequence:
	4743865..4744003 (link to nucleotide sequence provided)

	Correction to make full-length ORF?
	None


(v): 	Ka/Ks: A Ka/Ks test is conducted using yn00 from the PAML suite for each candidate in order to provide evidence of protein sequence-level conservation. 
	The Ka/Ks value provided on the main page is an average of the Ka/Ks ratio scores obtained for the protein translation of the start-to-stop ORF against each of the orthologs in the pillar. 
	The largest standard start-to-stop ORF within the stop-to-stop ORF is used in this test provided no genes with non-consensus start codons exist within the corresponding ortholog pillar
	- otherwise only the length of ORF from the start of the HSP to the end of the ORF is used.

e.g.	Omega average: 0.4358 details (link to detailed results)

	Clicking the hyperlink brings up a more detailed breakdown of Ka/Ks results. 
	Individual Ka/Ks values for the pairwise comparisons of the protein sequence of the candidate gene and each known ortholog.are provided, each with a 95% confidence interval to indicate the statistical significance of the result.
	For more information, see O hEigeartaigh et al (2014, forthcoming), and for background information on Ka/Ks tests see Yang et al. (5).

(vi): 	Annotated Feature: Information on whether an annotated feature (pseudogene, tRNA, rRNA) exists in the intergenic region containing the candidate gene is extracted from the GenBank file. 

e.g.	tRNA
	complement(4450439..4450515)

(vii): 	Start to Stop length (amino acids). 
	Note: In cases where alternative (non-ATG) start codons are annotated for pillar orthologs, the start of the HSP generated in the BLASTX search is used as an approximate start point for the start-to-stop gene
	- the ? indicates the approximate nature of the length.
	
(viii): Percentage of pillar median. 

(ix): 	Homolog Lengths: Amino acid lengths of proteins coded by predicted (P) and known (K) genes at the locus at which a hit is identified. 
	A candidate that is far longer or shorter than the other homologs at the locus is unlikely to be real/intact.

e.g.	YPAN	ECKT	SBOY	PSYR	SETY	ECSA	VCHO	ECSE	XCAM
		K 320	K 320		K 324	P 321		K 320	

(x):	 Amino acid sequences: AA sequences coded for by the stop-to-stop ORF identified by SearchDOGS and the annotated genes in the corresponding ortholog pillar. 
	Start codons are highlighted in green, stop codons in red. 
	In instances where the corresponding pillar contains a homolog with a non-ATG start codon, alternative start codons are highlighted in purple, orange, blue, black and red in the ORF.

e.g.	>ECSA_NP_310697.1-ECSA_NP_310702.1
	*hqygivMFTIKTDDLTHPAVQALVAYHISGMLQQSPPESSHALDVQKLRNPTVTFWSVW
	EGEQLAGIGALKLLDDKHGELKSMRTAPNYLRRGVASLILRHILQVAHDRCLHRLSLETG
	TQAGFTACHQLYLKHGFVDCEPFADYQLDPHSRFLSLTLCEDNELl
	>ECKT_AAC74999.1
	MFTIKTDDLTHPAVQALVAYHISGMLQQSPPESSHALDVQKLRNPTVTFWSVWEGEQLAG
	IGALKLLDDKHGELKSMRTAPNYLRRGVASLILRHILQVAQDRCLHRLSLETGTQAGFTA
	CHQLYLKHGFADCEPFADYRLDPHSRFLSLTLCENNELP*

----------------------------------------------

4. Notes and Frequently Asked Questions:

(i): 	GeneA-GeneB_ORFA
	In some instances, two or more different ORFs will be identified in the same genomic segments. 
	These may have sequence similarity to different homology pillars. 
	In instances with multiple ORFS within one genomic segment, the following identification system is used:
	GeneA_GeneB, GeneA-GeneB_ORFA, GeneA-GeneB_ORFB, GeneA-GeneB-ORFC 

e.g.	ECHS_YP_001457752.1-ECHS_YP_001457753.1, ECHS_YP_001457752.1-ECHS_YP_001457753.1_ORFA, ECHS_YP_001457752.1-ECHS_YP_001457753.1_ORFB


(ii): 	Pillar_altA	
	In some instances, two or more ORFs will be identified with sequence similarity to the same homology pillar, and with syntenic support that suggests they may belong in this homology pillar.
	This may be due to a gene split or duplication, or alternatively, one of the hits may be spurious.
	It is be up to the user to analyse the information provided to draw the appropriate conclusion.
	In these cases, the pillar containing the first identified ORF with homology and syntenic support is labelled pillar X (where X is the pillar number).
	For subsequent "hits" to the same pillar, the alternative pillar (containing the candidate ORF) will be labelled pillar X_altA, X_altB etc.

e.g.	adjacent left	Distance(nt) to adjacent gene	pillar hit (5081_altA)	Distance(nt) to adjacent gene	adjacent right
 					9		ECKT_AAC74109.1 <-				587	ECKT_AAC74110.1 ->
	ECHS_YP_001457867.1 <-		51		ECHS_YP_001457867.1-ECHS_YP_001457868.1 <-	506	ECHS_YP_001457868.1 ->
				 	 	 	 							VCHO_YP_001216931.1 ->
	
	Note: alternative ECHS candidate exists for this location, see pillar 5081

-----------------------------------------------------------------------------------------

PAML instructions for linux (note, same as online instructions but the line with the correction is starred):

UNIX, linux, and other systems: 
Download the the Win32 archive and save and unpack it into a local folder. 
Remove the Windows executables (.exe files) in the bin/ folder. 
(Replace 4.6 with the appropriate version number in the following commands.)

tar xzf paml4.6.tgz

Then cd to the paml folder (you have to remember where you saved the files) and again cd to the src/ folder and compile the programs.

rm bin/*.exe 
cd src 
make -f Makefile 
ls -lF 
rm *.o 
*** mv baseml basemlg codeml pamp evolver yn00 chi2 [[[../bin]]] /usr/bin ***
cd .. 
ls -lF bin 
bin/baseml 
bin/codeml 
bin/evolver 

Setting up a folder of local programs and change your initialization file for the shell. 
You need to do this for your user account only once. First check that there is a bin/ folder inside your account. 
If not, create one.

cd 
mkdir bin 

Then modify your path to include the bin/ folder in the initialization file for the shell. 
You can use more /etc/passwd to see which shell you run. Below are notes for the C shell and bash shell. 
There are other shells, but these two are commonly used.

(1) If you see /bin/csh for your account in the /etc/passwd file, you are running the C shell, and the intialization file is .cshrc in your root folder. 
You can use more .cshrc to see its content if it is present. 
Use a text editor (such as emacs, vi, SimpleText, etc.) to edit (or create, if one does not exist) the file, by something like

emacs .cshrc 
and insert the following line
set path = ($path . ~/bin) 

The different fields are separated by spaces. 
Here '.' means the current folder, and ~/ means your root folder, and ~/bin means the bin folder you created, and $path is whatever folders are already in the path.

(2) If you see /bin/bash in the file /etc/passwd for your account, you are running the bash shell, and the initialization file is .bashrc. 
Use a text editor to open .bashrc and insert the following line
PATH=$PATH:./:~/bin/ 

This changes the environment variable PATH. 
The different fields are separated by colon : and not space. 
If the file does not exist, create one.

After you have changed and saved the initialization file, every time you start a new shell, the path is automatically set for you. 
You can then cd to the folder which contain your data files and run paml programs there. 
The following moves to the paml folder (suppose you have extracted the archive into Programs/paml4.6/ on your account) and run program using the default files.
cd 
cd Programs/paml4.6 
codeml 

As the path is set up properly, this is equivalent to
~/bin/codeml 

Note that Windows uses \ while Unix uses /, and Windows is case-insensitive while Unix is case-sensitive.


References:

1.	Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. 
	Basic local alignment search tool. J Mol Biol 215:403-410.
2.	Stajich, J. E., D. Block, K. Boulez, S. E. Brenner, S. A. Chervitz, C. Dagdigian, G. Fuellen, J. G. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla, 
	C. J. Mungall, B. I. Osborne, M. R. Pocock, P. Schattner, M. Senger, L. D. Stein, E. Stupka, M. D. Wilkinson, and E. Birney. 2002. 
	The Bioperl toolkit: Perl modules for the life sciences. Genome research 12:1611-1618.
3.	Sievers, F., A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Soding, J. D. Thompson, and D. G. Higgins. 2011. 
	Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 7:539.
4.	Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555-556.
5.	Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586-1591.

