GENA-LM: A family of open-source foundational models for long DNA sequences
To precisely decode a genome, you need to extract contextual information from sequences which are thousands of base pairs long. Existing AI genomics tools struggle to handle such long sequences. We introduce a set of transformer-based DNA language models that can process up to an unrivalled 36k base pairs. They accurately infer features like promoters, enhancers and splice sites, and match or surpass previous models.