Scientific Background

The complete scientific background and working of the AppIndels server is described in "AppIndels.com Server: A Web Based Tool for Identification of Known Taxon-Specific Conserved Signature Indels in Genome Sequences and Its Applications for Identification of Unclassified Bacillus spp."

Please consult our paper for more additional information.


Genome sequence information is an invaluable tool for the grouping of prokaryotic species and construction of phylogenetic trees based on genes/proteins found at different taxonomic levels. Several methods exist based on average nucleotide identity, such as average amino-acid identity (AAI), digital DNA-DNA hybridization, and 16S rNA similarity values. Additionally, genome sequence information is useful for systematically delineating prokaryotic taxa within more specific groups (i.e. genus, family, etc.) based on shared genotypic, phenotypic, or other properties. However, there are currently no reliable means for the demarcation of genus level groupings, as the aforementioned methods are often unable to correctly distinguish between genera.

Previous studies have identified numerous conserved signature indels (CSIs), highly specific molecular markers which are uniquely shared by monophyletic groups of organisms. When present in expressed regions, these have proven very useful markers for evolutionary and taxonomic studies. They result from rare genetic changes and hence their shared presence within a specific group of organisms suggests that the change occurred in a common ancestor. For our purposes, CSIs are of fixed lengths and are present at specific positions in particular genes/proteins; additionally, they are flanked on both sides by exact conserved regions to ensure that genetic changes represented by them are reliable molecular characteristics.

 

These CSIs provide strong evidence supporting the common ancestry and evolutionary relatedness of species from specific clades, and they have been shown to exhibit strong predictive ability for newly identified or sequenced members of a clade. The AppIndels tool seeks to provide a convenient method for the use of CSIs for research purposes, which can be useful to (but not limited to):

·         Supplement other systematic methods to determine the taxonomy of prokaryotic organisms;

·         Reliably assign newly described species into different genera; and

·         Identify and classify unknown strains or species into known genera in conjunction with other methods.


How It Works

The AppIndels server relies on a database of previously identified CSIs. Each CSI entry includes an amino acid sequence, indel characteristics (i.e. insertion/deletion, length, position in sequence), its taxon specificity information, as well as a manually assigned relative weight. This is generally 0.3 – 0.5, depending on the number of CSIs identified for a given taxa. It serves to enhance the specificity of identifications by excluding results from isolated CSIs shared by unrelated species.

 

When a genome is uploaded, the server runs BLASTp searches against all sequences in the database. Hits are then validated to ensure that the indel is present in the same location as specified in the database and is flanked by 5+ conserved residues on both sides. The weight of all hits for a taxon is summed, and if the total exceeds the threshold of 1.0 then the genome is positively identified for that taxon. If insufficient hits are found to meet the threshold, the server does not display any result to avoid anomalous/misleading identifications. For more complete explanation of our process, please see Gupta & Kanter Eivin (2023).

Limitations

While the AppIndels server can be a very useful tool for genome analysis, there are several important limitations to be aware of concerning its use.

Presence of CSIs in the Database

Since the central aspect of the server is the CSI database used for BLASTp searches, its functionality depends on having CSI information for the taxa in question. We hope to continually expand the database to include CSIs from our other works and from user submissions, but since there is an extensive manual review process, this may take time. As such, we maintain a list (below) of the currently supported taxa. Uploading genomes which do not fit within supported taxa may cause an unreliable identification or no identification to be made. Generally, the presence of CSIs for a particular group (a positive result) is highly significant, but a negative result holds little significance as it may simply speak to a lack of CSIs for the particular taxa (see “Interpreting Your Result”).

Click here to view the supported taxa and the number of CSIs corresponding to each taxa.

If you wish to contribute additional CSIs to the database, please review the information within our paper and send all information to the corresponding author. Extensive validation is required and you are responsible for demonstrating the specificity and conservation of the CSI; additionally, given the labor intensive nature of vetting CSIs for inclusion in the database and our high-standards for inclusion, we may not include CSIs submitted via email.


Genome Quality

Generally speaking, the server will identify any CSIs which are included in a provided sequence (as long as they meet the threshold) and allow the user to interpret the results in conjunction with other findings (see “Interpreting Your Result”). As such, it is the user’s responsibility to ensure that submitted genomes are of sufficient quality. If the sequences corresponding to CSIs are in omitted parts of a partial genome sequence, the server will be unable to detect these and may therefore not produce a result (or not include all CSIs which would be present in the full sequence). Additionally, if an uploaded genome is contaminated, the server may make more than one identification (for the correct strain and the contaminated strain(s)) and will display all CSIs found. As such, users should corroborate with other methods and use due diligence when interpreting results.


Changes to Classification

Microbial taxonomy is a dynamic and fast changing discipline and new species related to different taxa are continually described. This can lead to divisions or unification within existing taxa, and thus the CSIs that were previously specific for a given taxa may no longer be specific for newly described taxon. Changes in classification generally do not invalidate the clade specificity of the described CSIs, but they may alter their taxonomic clade specificity.