Image for the paper "GENA-LM: A family of open-source foundational models for long DNA sequences"

Speaking DNA

Machine learning

GENA-LM: a family of open-source foundational DNA language models for long sequences

Nucleic Acids Research 53, 2 (2025)

V. Fishman, Y. Kuratov, A. Shmelev, M. Petrov, D. Penzar, D. Shepelin, N. Chekanov, O. Kardymon, M. Burtsev

To precisely decode a genome, you need to extract contextual information from sequences which are thousands of base pairs long. Existing AI genomics tools struggle to handle such long sequences. We introduce a set of transformer-based DNA language models that can process up to an unrivalled 36k base pairs. They accurately infer features like promoters, enhancers and splice sites, and match or surpass previous models.

Pdf
Arxiv

Nucleic Acids Research 53, 2 (2025)

V. Fishman, Y. Kuratov, A. Shmelev, M. Petrov, D. Penzar, D. Shepelin, N. Chekanov, O. Kardymon, M. Burtsev