Scalable and robust DNA-based storage via coding theory and deep learning

Bar-Lev, Daniella; Orr, Itai; Sabary, Omer; Etzion, Tuvi; Yaakobi, Eitan

doi:10.1038/s42256-025-01003-z

Article
Published: 21 February 2025

Scalable and robust DNA-based storage via coding theory and deep learning

Nature Machine Intelligence volume 7, pages 639–649 (2025)Cite this article

2437 Accesses
3 Citations
76 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

The global data sphere is expanding exponentially, projected to hit 180 zettabytes by 2025, whereas current technologies are not anticipated to scale at nearly the same rate. DNA-based storage emerges as a crucial solution to this gap, enabling digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of the main bottlenecks are the scalability and accuracy, which have a natural tradeoff between the two. Here we show a modular and holistic approach that combines deep neural networks trained on simulated data, tensor product-based error-correcting codes and a safety margin mechanism into a single coherent pipeline. We demonstrated our solution on 3.1 MB of information using two different sequencing technologies. Our work improves upon the current leading solutions with a 3,200× increase in speed and a 40% improvement in accuracy and offers a code rate of 1.6 bits per base in a high-noise regime. In a broader sense, our work shows a viable path to commercial DNA storage solutions hindered by current information retrieval processes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: End-to-end solution for DNA information retrieval.**

**Fig. 2: Data used for DNA experiments.**

**Fig. 3: Comparison of the DNAformer with SOTA DNA reconstruction methods.**

**Fig. 4: Evaluation of information retrieval performance.**

Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction

Article Open access 22 October 2020

A digital twin for DNA data storage based on comprehensive quantification of errors and biases

Article Open access 27 September 2023

A self-contained and self-explanatory DNA storage system

Article Open access 10 September 2021

Data availability

The datasets created and discussed in this Article are available via Zenodo at https://zenodo.org/records/13896773 (ref. ⁵⁰). The synthetic data generator used to create the simulated training data is available in the data generator folder of our GitHub repository at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. ⁵¹). The publicly available datasets used in this study are sourced from the following publications: Grass et al.^13,52, available via Zenodo at https://zenodo.org/records/14290755 (ref. ⁵³); Erlich et al.¹⁵, available at http://www.ebi.ac.uk/ena/data/view/PRJEB19305 and http://www.ebi.ac.uk/ena/data/view/PRJEB19307; Organick et al.¹⁶, available at http://misl.cs.washington.edu/data and https://github.com/uwmisl/data-nbt17; and Srinivasavaradhan et al.⁷, available via GitHub at https://github.com/microsoft/clustered-nanopore-reads-dataset. The datasets from refs. ^7,13,15, where the reads are classified into clusters by using our binning algorithm, are available via Zenodo at https://zenodo.org/records/14296588 (ref. ⁵⁴). Binning of the dataset from ref. ¹⁶ is possible by applying the preprocessing and binning scripts (reads_preprocessor.py and binning.py) available via GitHub at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. ⁵¹). A parametric comparison of previous DNA data storage experiments^55,56,57, with additional parameters related to our data, is provided in Supplementary Table 4.

Code availability

The code used for this work is available via GitHub at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. ⁵¹).

References

Rydning, D. R. J. G. J., Reinsel, J. & Gantz, J. The Digitization of the World from Edge to Core (International Data Corporation, 2018).
Meiser, L. C. et al. Synthetic DNA applications in information technology. Nat. Commun. 13, 352 (2022).
Article Google Scholar
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).
Article Google Scholar
LeProust, E. M. et al. Synthesis of high-quality libraries of long (150 mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).
Article Google Scholar
Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019).
Article Google Scholar
Sabary, O., Yucovich, A., Shapira, G. & Yaakobi, E. Reconstruction algorithms for DNA storage systems. Sci. Rep. 14, 951 (2024).
Article Google Scholar
Srinivasavaradhan, S. R., Gopi, S., Pfister, H. D. & Yekhanin, S. Trellis BMA: coded trace reconstruction on IDS channels for DNA storage. In 2021 IEEE International Symposium on Information Theory (ISIT) (ed. Dey, B.) 2453–2458 (IEEE, 2021); https://doi.org/10.1109/ISIT45174.2021.9517821
Lenz, A. et al. Concatenated codes for recovery from multiple reads of DNA sequences. In 2020 IEEE Information Theory Workshop (ITW) (ed. Dalai, M.) 1–5 (IEEE, 2021).
Levenshtein, V. I. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory. Ser. A 93, 310–332 (2001).
Article MathSciNet Google Scholar
McGregor, A., Price, E. & Vorotnikova, S. Trace reconstruction revisited. In European Symposium on Algorithms (eds Schulz, A. S. & Wagner, D.) 689–700 (Springer, 2014).
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628–1628 (2012).
Article Google Scholar
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Article Google Scholar
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555, (2015).
Google Scholar
MacWilliams, F. J. & Sloane, N. J. A. The Theory of Error-Correcting Codes Vol. 16 (Elsevier, 1997).
Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Article Google Scholar
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Article Google Scholar
Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).
Article Google Scholar
Wang, Y. et al. High capacity DNA data storage with variable-length oligonucleotides using repeat accumulate code and hybrid mapping. J. Biol. Eng. 13, 1–11 (2019).
Article Google Scholar
Chandak, S. et al. Overcoming high Nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes. In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ed. Pérez-Neira, A. I.) 8822–8826 (IEEE, 2020).
Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1237 (2019).
Article Google Scholar
Cheraghchi, M. & Ribeiro, J. An overview of capacity results for synchronization channels. IEEE Trans. Inf. Theory 67, 3207–3232 (2020).
Article MathSciNet Google Scholar
Qu, G., Yan, Z. & Wu, H. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Brief. Bioinform. 23, bbac336 (2022).
Article Google Scholar
Rashtchian, C. et al. Clustering billions of reads for DNA data storage. Adv. Neural Inf. Process. Syst. 30, 3362–3373 (2017).
Google Scholar
Viswanathan, K. & Swaminathan, R. Improved string reconstruction over insertion–deletion channels. In Proc. 19th annual ACM–SIAM Symposium on Discrete Algorithms (ed. Teng, S.-H.) 399–408 (Society for Industrial and Applied Mathematics, 2008).
Batu, T., Kannan, S., Khanna, S. & McGregor, A. Reconstructing strings from random traces. SODA 4, 910–918 (2004).
MathSciNet Google Scholar
Holden, N., Pemantle, R. & Peres, Y. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference on Learning Theory Vol. 75 (eds Bubeck, S. et al.) 1799–1840 (PMLR, 2018).
Holenstein, T., Mitzenmacher, M., Panigrahy, R. & Wieder, U. Trace reconstruction with constant deletion probability and related results. In Proc. 19th Annual ACM–SIAM Symposium on Discrete Algorithms (ed. Teng, S.-H.) 389–398 (Society for Industrial and Applied Mathematics, 2008).
Nazarov, F. & Peres, Y. Trace reconstruction with exp(O(n^1/3)) samples. In Proc. 49th Annual ACM SIGACT Symposium on Theory of Computing (eds Hatami, H. & McKenzie, P.) 1042–1046 (ACM, 2017).
Peres, Y. & Zhai, A. Average-case reconstruction for the deletion channel: subpolynomially many traces suffice. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (ed. Umans, C.) 228–239 (IEEE Computer Society, 2017).
Bee, C. et al. Content-based similarity search in large-scale DNA data storage systems. Nat. Commun. 12, 4764 (2021).
Article Google Scholar
Pan, C. et al. Rewritable two-dimensional DNA-based data storage with machine learning reconstruction. Nat. Commun. 13, 2984 (2022).
Article Google Scholar
Wolf, J. On codes derivable from the tensor product of check matrices. IEEE Trans. Inf. Theory 11, 281–284 (1965).
Article MathSciNet Google Scholar
Sabary, O. et al. SOLQC: synthetic oligo library quality control tool. Bioinformatics 37, 720–722 (2021).
Article Google Scholar
Preserving Our Digital Legacy: an Introduction to DNA Data Storage (DNA Data Storage Alliance, 2021).
Gopalan, P. S. et al. Trace reconstruction from noisy polynucleotide sequencer reads. US Patent Application 15/536, 115 (2018).
Google Scholar
Marcus, B. H., Roth, R. M., & Siegel, P. H. Constrained systems and coding for recording channels. In An Introduction to Coding for Constrained Systems (eds Pless, V. & Huffman, W. C.) (Elsevier, 1998).
Gimpel, A. L., Stark, W. J., Heckel, R. & Grass, R. N. A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat. Commun. 14, 6026 (2023).
Article Google Scholar
Bohlin, J., Rose, B. & Pettersson, J. H. O. Estimation of AT and GC content distributions of nucleotide substitution rates in bacterial core genomes. Big Data Anal. 4, 1–11 (2019).
Article Google Scholar
Weindel, F., Gimpel, A. L., Grass, R. N. & Heckel, R. Embracing errors is more effective than avoiding them through constrained coding for DNA data storage. In 2023 59th Annual Allerton Conference on Communication, Control, and Computing 1–8 (IEEE, 2023).
Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3, lqab019 (2021).
Article Google Scholar
Ping, Z. et al. Towards practical and robust DNA-based data archiving using the yin-yang codec system. Nat. Comput. Sci. 2, 234–242 (2022).
Article Google Scholar
Chaykin, G., Furman, N., Sabary, O., Ben-Shabat, D. & Yaakobi, E. DNA-storalator: end-to-end DNA storage simulator. In 13th Annual Non-volatile Memories Workshop (2022).
Xiao, T. et al. Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. 34, 30392–30400 (2021).
Google Scholar
Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).
Article Google Scholar
Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (IEEE, 2017).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017).
Menin, J. F. & Nichols, N. M. Multiplex PCR using Q5 High-Fidelity DNA Polymerase (New England Biolabs, 2013).
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614–620 (2014).
Article Google Scholar
Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39, 1348–1365 (2021).
Article Google Scholar
Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Datasets of scalable and robust DNA-based storage via coding theory and deep learning. Zenodo https://doi.org/10.5281/zenodo.13896773 (2024).
Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Code repository for scalable and robust DNA-based storage via coding theory and deep learning. Zenodo https://doi.org/10.5281/zenodo.14266018 (2024).
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Dataset for “Robust chemical preservation of digital information on DNA in silica with error-correcting codes”. Zenodo https://doi.org/10.5281/zenodo.14290754 (2015).
Grass, R. N. et al. Dataset for “Robust chemical preservation of digital information on DNA in silica with error-correcting codes”. Zenodo https://doi.org/10.5281/zenodo.14290755 (2015).
Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Prepared binned DNA data storage datasets for reconstruction benchmarking. Zenodo https://doi.org/10.5281/zenodo.14296588 (2024).
Bornholt, J. et al. A DNA-based archival storage system. In Proc. 21st International Conference on Architectural Support for Programming Languages and Operating Systems 637–649 (ACM, 2016).
Blawat, M. et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016).
Article Google Scholar
Wagner, R. A. & Fischer, M. J. The string-to-string correction problem. J. ACM 21, 168–173 (1974).
Article MathSciNet Google Scholar

Download references

Acknowledgements

D.B.-L., I.O., O.S. and E.Y. were Funded by the European Union (ERC, DNAStorage, 101045114 and EIC, DiDAX 101115134). The views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. D.B.-L. and T.E. were supported in part by ISF grant no. 222/19. The authors thank S. Goldberg, N. Kikuchi and R. Amit for assistance with the wet experiments and providing the lab and equipment; N. Fourier and L. Linde for their help in the sequencing processes and D. B. Shabat for his assistance and advice. Finally, we thank A. Yucovich for his help with developing the CPL algorithm.

Author information

These authors contributed equally: Daniella Bar-Lev, Itai Orr, Omer Sabary.

Authors and Affiliations

Computer Science Faculty, Technion–Israel Institute of Technology, Haifa, Israel
Daniella Bar-Lev, Itai Orr, Omer Sabary, Tuvi Etzion & Eitan Yaakobi
UVeye Ltd., Tel Aviv, Israel
Itai Orr

Authors

Daniella Bar-Lev
View author publications
Search author on:PubMed Google Scholar
Itai Orr
View author publications
Search author on:PubMed Google Scholar
Omer Sabary
View author publications
Search author on:PubMed Google Scholar
Tuvi Etzion
View author publications
Search author on:PubMed Google Scholar
Eitan Yaakobi
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: D.B.-L., I.O. and O.S. Methodology: D.B.-L., I.O. and O.S. Investigation: D.B.-L., I.O., O.S., T.E. and E.Y. Supervision: T.E. and E.Y. Writing, review and editing: D.B.-L., I.O., O.S., T.E. and E.Y.

Corresponding authors

Correspondence to Daniella Bar-Lev, Itai Orr or Omer Sabary.

Ethics declarations

Competing interests

D.B.-L., I.O., O.S., T.E. and E.Y. are inventors on patent application US 18/233,855 submitted by the Technion–Israel Institute of Technology and Bar Ilan University.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Supplementary Results, Figs. 1–13 and Tables 1–10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bar-Lev, D., Orr, I., Sabary, O. et al. Scalable and robust DNA-based storage via coding theory and deep learning. Nat Mach Intell 7, 639–649 (2025). https://doi.org/10.1038/s42256-025-01003-z

Download citation

Received: 12 June 2024
Accepted: 23 January 2025
Published: 21 February 2025
Issue Date: April 2025
DOI: https://doi.org/10.1038/s42256-025-01003-z