V. Fishman

Y. Kuratov

M. Petrov

A. Shmelev

D. Shepelin

N. Chekanov

O. Kardymon

M. Burtsev

A family of transformer-based DNA language models can interpret genomic sequences, opening new possibilities for complex biological research.


To precisely decode a genome, you need to extract contextual information from sequences which are thousands of base pairs long. Existing AI genomics tools struggle to handle such long sequences. We introduce a set of transformer-based DNA language models that can process up to an unrivalled 36k base pairs. They accurately infer features like promoters, enhancers and splice sites, and match or surpass previous models.

The field of genomics has seen substantial advancements through the application of artificial intelligence (AI), with machine learning revealing the potential to interpret genomic sequences without necessitating an exhaustive experimental analysis of all the intricate and interconnected molecular processes involved in DNA functioning. However, precise decoding of genomic sequences demands the comprehension of rich contextual information spread over thousands of nucleotides. Presently, only a few architectures exist that can process such extensive inputs, and they require exceptional computational resources. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 thousands base pairs. We offer pre-trained versions of GENA-LM and demonstrate their capacity for fine-tuning to address complex biological questions with modest computational requirements. We also illustrate diverse applications of GENA-LM for various downstream genomic tasks, showcasing its performance in either matching or exceeding that of prior models, whether task-specific or universal. All models are publicly accessible on GitHub https://github.com/AIRI-Institute/GENA_LM and as pre-trained models with gena-lm-prefix on HuggingFace https://huggingface.co/AIRI-Institute.

Speaking DNA

GENA-LM: A family of open-source foundational models for long DNA sequences