Speaking DNA
Machine learning
GENA-LM: A family of open-source foundational models for long DNA sequences
Arxiv (2023)
To precisely decode a genome, you need to extract contextual information from sequences which are thousands of base pairs long. Existing AI genomics tools struggle to handle such long sequences. We introduce a set of transformer-based DNA language models that can process up to an unrivalled 36k base pairs. They accurately infer features like promoters, enhancers and splice sites, and match or surpass previous models.
Arxiv (2023)