Russian AIRI Institute Develops Genatator, a Neural Network for Gene Annotation
2026-07-05 16:23
Favorite

en.Wedoany.com Reported - Scientists at the AIRI Institute have developed a neural network model called Genatator, which can construct gene maps based on DNA sequences and annotate genomes lacking detailed biological data. After receiving a DNA sequence, the model determines gene boundaries, identifies transcript types, and reconstructs structures, distinguishing between genes, exons, introns, and other regions.

Finding genes in DNA is challenging because genes lack universal start and stop signals; boundaries depend on combinations of short motifs, whose significance is determined by context. Genes may overlap and be located on different strands of DNA.

The Genatator neural network model works in stages. The model first searches for potential transcription start and stop sites on both DNA strands, after which another model checks whether the region resembles a gene. After a classifier determines the transcript type, a segmentation model refines the gene structure and identifies exons and introns. Finally, the algorithm removes suspicious predictions and produces the final annotation.

This method differs from traditional tools in that the model does not rely solely on preset rules. Traditional tools utilize features of protein-coding genes, such as start codons, stop codons, and splice signals, and perform poorly on untranslated regions and long non-coding RNAs. The new model is trained on large genomic datasets and directly searches for patterns in DNA.

This approach is particularly important for non-model organisms. Humans and mice have detailed annotations after decades of research, but most organisms only have unannotated genome assemblies. Analysis shows that out of 4,582 mammalian genome assemblies in the NCBI database, only 166 have annotations, making unannotated genomes difficult to use for research.

The system can identify two types of genes: protein-coding genes and long non-coding RNA genes. For both types, the system determines exons and introns; for protein-coding genes, it additionally annotates the CDS region as well as the 5'-UTR and 3'-UTR regions.

Genatator was trained on genes from humans and 38 mammalian species, including walruses and elephants. The model also performs well on other organisms not included in the training, such as the fruit fly Drosophila melanogaster, the thale cress Arabidopsis thaliana, and the budding yeast Saccharomyces cerevisiae.

The model has also discovered some rare regions known as "poison exons," whose inclusion can lead to RNA degradation. Such elements rarely appear even in high-quality annotations. The developers paid particular attention to the precision of gene boundaries, as a single nucleotide error can cause a frameshift, distorting protein predictions.

Veniamin Fishman, Doctor of Biological Sciences and Chief Researcher at the AIRI Institute and the Institute of Cytology and Genetics of the Siberian Branch of the Russian Academy of Sciences (ICiG SB RAS), noted that the rate of new genome assembly exceeds the rate of annotation, and models like this can serve as a first step in analysis, enabling faster acquisition of candidate gene maps for validation.

To assess quality, the team created a public leaderboard comparing the model with other approaches. The model performed best across multiple metrics. The training dataset was prepared by scientists from the Sirius University of Science and Technology (Nauchno-tekhnologicheskiy universitet «Sirius») and the Institute of Cytology and Genetics of the Siberian Branch of the Russian Academy of Sciences (ICiG SB RAS).

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com