Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Scalable and robust DNA-based storage via coding theory and deep learning

A preprint version of the article is available at arXiv.

Abstract

The global data sphere is expanding exponentially, projected to hit 180 zettabytes by 2025, whereas current technologies are not anticipated to scale at nearly the same rate. DNA-based storage emerges as a crucial solution to this gap, enabling digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of the main bottlenecks are the scalability and accuracy, which have a natural tradeoff between the two. Here we show a modular and holistic approach that combines deep neural networks trained on simulated data, tensor product-based error-correcting codes and a safety margin mechanism into a single coherent pipeline. We demonstrated our solution on 3.1 MB of information using two different sequencing technologies. Our work improves upon the current leading solutions with a 3,200× increase in speed and a 40% improvement in accuracy and offers a code rate of 1.6 bits per base in a high-noise regime. In a broader sense, our work shows a viable path to commercial DNA storage solutions hindered by current information retrieval processes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: End-to-end solution for DNA information retrieval.
Fig. 2: Data used for DNA experiments.
Fig. 3: Comparison of the DNAformer with SOTA DNA reconstruction methods.
Fig. 4: Evaluation of information retrieval performance.

Similar content being viewed by others

Data availability

The datasets created and discussed in this Article are available via Zenodo at https://zenodo.org/records/13896773 (ref. 50). The synthetic data generator used to create the simulated training data is available in the data generator folder of our GitHub repository at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. 51). The publicly available datasets used in this study are sourced from the following publications: Grass et al.13,52, available via Zenodo at https://zenodo.org/records/14290755 (ref. 53); Erlich et al.15, available at http://www.ebi.ac.uk/ena/data/view/PRJEB19305 and http://www.ebi.ac.uk/ena/data/view/PRJEB19307; Organick et al.16, available at http://misl.cs.washington.edu/data and https://github.com/uwmisl/data-nbt17; and Srinivasavaradhan et al.7, available via GitHub at https://github.com/microsoft/clustered-nanopore-reads-dataset. The datasets from refs. 7,13,15, where the reads are classified into clusters by using our binning algorithm, are available via Zenodo at https://zenodo.org/records/14296588 (ref. 54). Binning of the dataset from ref. 16 is possible by applying the preprocessing and binning scripts (reads_preprocessor.py and binning.py) available via GitHub at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. 51). A parametric comparison of previous DNA data storage experiments55,56,57, with additional parameters related to our data, is provided in Supplementary Table 4.

Code availability

The code used for this work is available via GitHub at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. 51).

References

  1. Rydning, D. R. J. G. J., Reinsel, J. & Gantz, J. The Digitization of the World from Edge to Core (International Data Corporation, 2018).

  2. Meiser, L. C. et al. Synthetic DNA applications in information technology. Nat. Commun. 13, 352 (2022).

    Article  Google Scholar 

  3. Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).

    Article  Google Scholar 

  4. LeProust, E. M. et al. Synthesis of high-quality libraries of long (150 mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).

    Article  Google Scholar 

  5. Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019).

    Article  Google Scholar 

  6. Sabary, O., Yucovich, A., Shapira, G. & Yaakobi, E. Reconstruction algorithms for DNA storage systems. Sci. Rep. 14, 951 (2024).

    Article  Google Scholar 

  7. Srinivasavaradhan, S. R., Gopi, S., Pfister, H. D. & Yekhanin, S. Trellis BMA: coded trace reconstruction on IDS channels for DNA storage. In 2021 IEEE International Symposium on Information Theory (ISIT) (ed. Dey, B.) 2453–2458 (IEEE, 2021); https://doi.org/10.1109/ISIT45174.2021.9517821

  8. Lenz, A. et al. Concatenated codes for recovery from multiple reads of DNA sequences. In 2020 IEEE Information Theory Workshop (ITW) (ed. Dalai, M.) 1–5 (IEEE, 2021).

  9. Levenshtein, V. I. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory. Ser. A 93, 310–332 (2001).

    Article  MathSciNet  Google Scholar 

  10. McGregor, A., Price, E. & Vorotnikova, S. Trace reconstruction revisited. In European Symposium on Algorithms (eds Schulz, A. S. & Wagner, D.) 689–700 (Springer, 2014).

  11. Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628–1628 (2012).

    Article  Google Scholar 

  12. Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

    Article  Google Scholar 

  13. Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555, (2015).

    Google Scholar 

  14. MacWilliams, F. J. & Sloane, N. J. A. The Theory of Error-Correcting Codes Vol. 16 (Elsevier, 1997).

  15. Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

    Article  Google Scholar 

  16. Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).

    Article  Google Scholar 

  17. Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).

    Article  Google Scholar 

  18. Wang, Y. et al. High capacity DNA data storage with variable-length oligonucleotides using repeat accumulate code and hybrid mapping. J. Biol. Eng. 13, 1–11 (2019).

    Article  Google Scholar 

  19. Chandak, S. et al. Overcoming high Nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes. In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ed. Pérez-Neira, A. I.) 8822–8826 (IEEE, 2020).

  20. Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1237 (2019).

    Article  Google Scholar 

  21. Cheraghchi, M. & Ribeiro, J. An overview of capacity results for synchronization channels. IEEE Trans. Inf. Theory 67, 3207–3232 (2020).

    Article  MathSciNet  Google Scholar 

  22. Qu, G., Yan, Z. & Wu, H. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Brief. Bioinform. 23, bbac336 (2022).

    Article  Google Scholar 

  23. Rashtchian, C. et al. Clustering billions of reads for DNA data storage. Adv. Neural Inf. Process. Syst. 30, 3362–3373 (2017).

    Google Scholar 

  24. Viswanathan, K. & Swaminathan, R. Improved string reconstruction over insertion–deletion channels. In Proc. 19th annual ACM–SIAM Symposium on Discrete Algorithms (ed. Teng, S.-H.) 399–408 (Society for Industrial and Applied Mathematics, 2008).

  25. Batu, T., Kannan, S., Khanna, S. & McGregor, A. Reconstructing strings from random traces. SODA 4, 910–918 (2004).

    MathSciNet  Google Scholar 

  26. Holden, N., Pemantle, R. & Peres, Y. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference on Learning Theory Vol. 75 (eds Bubeck, S. et al.) 1799–1840 (PMLR, 2018).

  27. Holenstein, T., Mitzenmacher, M., Panigrahy, R. & Wieder, U. Trace reconstruction with constant deletion probability and related results. In Proc. 19th Annual ACM–SIAM Symposium on Discrete Algorithms (ed. Teng, S.-H.) 389–398 (Society for Industrial and Applied Mathematics, 2008).

  28. Nazarov, F. & Peres, Y. Trace reconstruction with exp(O(n1/3)) samples. In Proc. 49th Annual ACM SIGACT Symposium on Theory of Computing (eds Hatami, H. & McKenzie, P.) 1042–1046 (ACM, 2017).

  29. Peres, Y. & Zhai, A. Average-case reconstruction for the deletion channel: subpolynomially many traces suffice. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (ed. Umans, C.) 228–239 (IEEE Computer Society, 2017).

  30. Bee, C. et al. Content-based similarity search in large-scale DNA data storage systems. Nat. Commun. 12, 4764 (2021).

    Article  Google Scholar 

  31. Pan, C. et al. Rewritable two-dimensional DNA-based data storage with machine learning reconstruction. Nat. Commun. 13, 2984 (2022).

    Article  Google Scholar 

  32. Wolf, J. On codes derivable from the tensor product of check matrices. IEEE Trans. Inf. Theory 11, 281–284 (1965).

    Article  MathSciNet  Google Scholar 

  33. Sabary, O. et al. SOLQC: synthetic oligo library quality control tool. Bioinformatics 37, 720–722 (2021).

    Article  Google Scholar 

  34. Preserving Our Digital Legacy: an Introduction to DNA Data Storage (DNA Data Storage Alliance, 2021).

  35. Gopalan, P. S. et al. Trace reconstruction from noisy polynucleotide sequencer reads. US Patent Application 15/536, 115 (2018).

    Google Scholar 

  36. Marcus, B. H., Roth, R. M., & Siegel, P. H. Constrained systems and coding for recording channels. In An Introduction to Coding for Constrained Systems (eds Pless, V. & Huffman, W. C.) (Elsevier, 1998).

  37. Gimpel, A. L., Stark, W. J., Heckel, R. & Grass, R. N. A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat. Commun. 14, 6026 (2023).

    Article  Google Scholar 

  38. Bohlin, J., Rose, B. & Pettersson, J. H. O. Estimation of AT and GC content distributions of nucleotide substitution rates in bacterial core genomes. Big Data Anal. 4, 1–11 (2019).

    Article  Google Scholar 

  39. Weindel, F., Gimpel, A. L., Grass, R. N. & Heckel, R. Embracing errors is more effective than avoiding them through constrained coding for DNA data storage. In 2023 59th Annual Allerton Conference on Communication, Control, and Computing 1–8 (IEEE, 2023).

  40. Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3, lqab019 (2021).

    Article  Google Scholar 

  41. Ping, Z. et al. Towards practical and robust DNA-based data archiving using the yin-yang codec system. Nat. Comput. Sci. 2, 234–242 (2022).

    Article  Google Scholar 

  42. Chaykin, G., Furman, N., Sabary, O., Ben-Shabat, D. & Yaakobi, E. DNA-storalator: end-to-end DNA storage simulator. In 13th Annual Non-volatile Memories Workshop (2022).

  43. Xiao, T. et al. Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. 34, 30392–30400 (2021).

    Google Scholar 

  44. Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).

    Article  Google Scholar 

  45. Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (IEEE, 2017).

  46. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017).

  47. Menin, J. F. & Nichols, N. M. Multiplex PCR using Q5 High-Fidelity DNA Polymerase (New England Biolabs, 2013).

  48. Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614–620 (2014).

    Article  Google Scholar 

  49. Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39, 1348–1365 (2021).

    Article  Google Scholar 

  50. Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Datasets of scalable and robust DNA-based storage via coding theory and deep learning. Zenodo https://doi.org/10.5281/zenodo.13896773 (2024).

  51. Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Code repository for scalable and robust DNA-based storage via coding theory and deep learning. Zenodo https://doi.org/10.5281/zenodo.14266018 (2024).

  52. Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Dataset for “Robust chemical preservation of digital information on DNA in silica with error-correcting codes”. Zenodo https://doi.org/10.5281/zenodo.14290754 (2015).

  53. Grass, R. N. et al. Dataset for “Robust chemical preservation of digital information on DNA in silica with error-correcting codes”. Zenodo https://doi.org/10.5281/zenodo.14290755 (2015).

  54. Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Prepared binned DNA data storage datasets for reconstruction benchmarking. Zenodo https://doi.org/10.5281/zenodo.14296588 (2024).

  55. Bornholt, J. et al. A DNA-based archival storage system. In Proc. 21st International Conference on Architectural Support for Programming Languages and Operating Systems 637–649 (ACM, 2016).

  56. Blawat, M. et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016).

    Article  Google Scholar 

  57. Wagner, R. A. & Fischer, M. J. The string-to-string correction problem. J. ACM 21, 168–173 (1974).

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

D.B.-L., I.O., O.S. and E.Y. were Funded by the European Union (ERC, DNAStorage, 101045114 and EIC, DiDAX 101115134). The views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. D.B.-L. and T.E. were supported in part by ISF grant no. 222/19. The authors thank S. Goldberg, N. Kikuchi and R. Amit for assistance with the wet experiments and providing the lab and equipment; N. Fourier and L. Linde for their help in the sequencing processes and D. B. Shabat for his assistance and advice. Finally, we thank A. Yucovich for his help with developing the CPL algorithm.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: D.B.-L., I.O. and O.S. Methodology: D.B.-L., I.O. and O.S. Investigation: D.B.-L., I.O., O.S., T.E. and E.Y. Supervision: T.E. and E.Y. Writing, review and editing: D.B.-L., I.O., O.S., T.E. and E.Y.

Corresponding authors

Correspondence to Daniella Bar-Lev, Itai Orr or Omer Sabary.

Ethics declarations

Competing interests

D.B.-L., I.O., O.S., T.E. and E.Y. are inventors on patent application US 18/233,855 submitted by the Technion–Israel Institute of Technology and Bar Ilan University.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Supplementary Results, Figs. 1–13 and Tables 1–10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bar-Lev, D., Orr, I., Sabary, O. et al. Scalable and robust DNA-based storage via coding theory and deep learning. Nat Mach Intell 7, 639–649 (2025). https://doi.org/10.1038/s42256-025-01003-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-025-01003-z

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing