GeneMark ES/ET has been used to annotate thousands of eukaryotic genomes sequenced since 1997
NCBI has developed a new approach to genome annotation that combines alignment-based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. PGAP determines structural annotation by comparing open reading frames (ORFs) to libraries of protein hidden Markov models (HMMs), representative RefSeq proteins, and proteins from well characterized reference genomes. GeneMarkS-2+ then makes ab initio coding region predictions for genomic regions that lack HMM or protein evidence and selects start sites for ORFs whose evidence comes from HMMs.
Structural annotation
Proteins
ORFs are predicted by ORFfinder in all six frames of the genome and searched against a library of HMMs (TIGRFAM, Pfam; PRK HMMs, and NCBIfams, a collection for high-value protein families, including proteins involved in antimicrobial resistance). Short ORFs without HMM hits that overlap with ORFs with hits are dropped. The remaining translated ORFs are searched against BlastRules, proteins from lineage-specific reference genomes and protein cluster representatives, using BLAST followed by ProSplign (ProSplign aligns proteins even in the presence of frameshifts). HMM hits and protein alignments are mapped from ORFs to the genome. The final set of predicted proteins is made based on the resulting aligning evidence, and the ab initio gene-finding program GeneMark-S2+, in regions that lack protein alignment evidence.
Note that the final annotation can contain programmed frameshifts/ribosomal slippage for some transposases and PrfB genes, etc. and provides a translated CDS feature for these genes. Selenoproteins are detected as well. Other frameshifts or internal stops are annotated as pseudo. PGAP also annotates partial genes when it cannot find start or stop for the evidence. Partial genes are translated when abutting sequence ends or gaps, or flagged as pseudo in the middle of the sequence.
Non-coding RNA
Structural RNAs/small ncRNAs
Structural RNAs (5S, 16S, and 23S rRNAs) are highly conserved in closely related prokaryotic species. For the 16S and 23S rRNAs the NCBI Reference Sequence Collection (RefSeq) contains a curated set of reference sequences. The pipeline uses a BLASTn search against the reference set to identify these rRNA. 5S rRNAs and small ncRNAs are identified using RFAM HMMs, these hits are further refined using Cmsearch. Partial alignments that fall below 50% of the average length are dropped.