Abstract
The global data sphere is expanding exponentially, projected to hit 180 zettabytes by 2025, whereas current technologies are not anticipated to scale at nearly the same rate. DNA-based storage emerges as a crucial solution to this gap, enabling digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of the main bottlenecks are the scalability and accuracy, which have a natural tradeoff between the two. Here we show a modular and holistic approach that combines deep neural networks trained on simulated data, tensor product-based error-correcting codes and a safety margin mechanism into a single coherent pipeline. We demonstrated our solution on 3.1 MB of information using two different sequencing technologies. Our work improves upon the current leading solutions with a 3,200× increase in speed and a 40% improvement in accuracy and offers a code rate of 1.6 bits per base in a high-noise regime. In a broader sense, our work shows a viable path to commercial DNA storage solutions hindered by current information retrieval processes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
The datasets created and discussed in this Article are available via Zenodo at https://zenodo.org/records/13896773 (ref. 50). The synthetic data generator used to create the simulated training data is available in the data generator folder of our GitHub repository at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. 51). The publicly available datasets used in this study are sourced from the following publications: Grass et al.13,52, available via Zenodo at https://zenodo.org/records/14290755 (ref. 53); Erlich et al.15, available at http://www.ebi.ac.uk/ena/data/view/PRJEB19305 and http://www.ebi.ac.uk/ena/data/view/PRJEB19307; Organick et al.16, available at http://misl.cs.washington.edu/data and https://github.com/uwmisl/data-nbt17; and Srinivasavaradhan et al.7, available via GitHub at https://github.com/microsoft/clustered-nanopore-reads-dataset. The datasets from refs. 7,13,15, where the reads are classified into clusters by using our binning algorithm, are available via Zenodo at https://zenodo.org/records/14296588 (ref. 54). Binning of the dataset from ref. 16 is possible by applying the preprocessing and binning scripts (reads_preprocessor.py and binning.py) available via GitHub at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. 51). A parametric comparison of previous DNA data storage experiments55,56,57, with additional parameters related to our data, is provided in Supplementary Table 4.
Code availability
The code used for this work is available via GitHub at https://github.com/itaiorr/Deep-DNA-based-storage.git (ref. 51).
References
Rydning, D. R. J. G. J., Reinsel, J. & Gantz, J. The Digitization of the World from Edge to Core (International Data Corporation, 2018).
Meiser, L. C. et al. Synthetic DNA applications in information technology. Nat. Commun. 13, 352 (2022).
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).
LeProust, E. M. et al. Synthesis of high-quality libraries of long (150 mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).
Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019).
Sabary, O., Yucovich, A., Shapira, G. & Yaakobi, E. Reconstruction algorithms for DNA storage systems. Sci. Rep. 14, 951 (2024).
Srinivasavaradhan, S. R., Gopi, S., Pfister, H. D. & Yekhanin, S. Trellis BMA: coded trace reconstruction on IDS channels for DNA storage. In 2021 IEEE International Symposium on Information Theory (ISIT) (ed. Dey, B.) 2453–2458 (IEEE, 2021); https://doi.org/10.1109/ISIT45174.2021.9517821
Lenz, A. et al. Concatenated codes for recovery from multiple reads of DNA sequences. In 2020 IEEE Information Theory Workshop (ITW) (ed. Dalai, M.) 1–5 (IEEE, 2021).
Levenshtein, V. I. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory. Ser. A 93, 310–332 (2001).
McGregor, A., Price, E. & Vorotnikova, S. Trace reconstruction revisited. In European Symposium on Algorithms (eds Schulz, A. S. & Wagner, D.) 689–700 (Springer, 2014).
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628–1628 (2012).
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555, (2015).
MacWilliams, F. J. & Sloane, N. J. A. The Theory of Error-Correcting Codes Vol. 16 (Elsevier, 1997).
Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).
Wang, Y. et al. High capacity DNA data storage with variable-length oligonucleotides using repeat accumulate code and hybrid mapping. J. Biol. Eng. 13, 1–11 (2019).
Chandak, S. et al. Overcoming high Nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes. In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ed. Pérez-Neira, A. I.) 8822–8826 (IEEE, 2020).
Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1237 (2019).
Cheraghchi, M. & Ribeiro, J. An overview of capacity results for synchronization channels. IEEE Trans. Inf. Theory 67, 3207–3232 (2020).
Qu, G., Yan, Z. & Wu, H. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Brief. Bioinform. 23, bbac336 (2022).
Rashtchian, C. et al. Clustering billions of reads for DNA data storage. Adv. Neural Inf. Process. Syst. 30, 3362–3373 (2017).
Viswanathan, K. & Swaminathan, R. Improved string reconstruction over insertion–deletion channels. In Proc. 19th annual ACM–SIAM Symposium on Discrete Algorithms (ed. Teng, S.-H.) 399–408 (Society for Industrial and Applied Mathematics, 2008).
Batu, T., Kannan, S., Khanna, S. & McGregor, A. Reconstructing strings from random traces. SODA 4, 910–918 (2004).
Holden, N., Pemantle, R. & Peres, Y. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference on Learning Theory Vol. 75 (eds Bubeck, S. et al.) 1799–1840 (PMLR, 2018).
Holenstein, T., Mitzenmacher, M., Panigrahy, R. & Wieder, U. Trace reconstruction with constant deletion probability and related results. In Proc. 19th Annual ACM–SIAM Symposium on Discrete Algorithms (ed. Teng, S.-H.) 389–398 (Society for Industrial and Applied Mathematics, 2008).
Nazarov, F. & Peres, Y. Trace reconstruction with exp(O(n1/3)) samples. In Proc. 49th Annual ACM SIGACT Symposium on Theory of Computing (eds Hatami, H. & McKenzie, P.) 1042–1046 (ACM, 2017).
Peres, Y. & Zhai, A. Average-case reconstruction for the deletion channel: subpolynomially many traces suffice. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (ed. Umans, C.) 228–239 (IEEE Computer Society, 2017).
Bee, C. et al. Content-based similarity search in large-scale DNA data storage systems. Nat. Commun. 12, 4764 (2021).
Pan, C. et al. Rewritable two-dimensional DNA-based data storage with machine learning reconstruction. Nat. Commun. 13, 2984 (2022).
Wolf, J. On codes derivable from the tensor product of check matrices. IEEE Trans. Inf. Theory 11, 281–284 (1965).
Sabary, O. et al. SOLQC: synthetic oligo library quality control tool. Bioinformatics 37, 720–722 (2021).
Preserving Our Digital Legacy: an Introduction to DNA Data Storage (DNA Data Storage Alliance, 2021).
Gopalan, P. S. et al. Trace reconstruction from noisy polynucleotide sequencer reads. US Patent Application 15/536, 115 (2018).
Marcus, B. H., Roth, R. M., & Siegel, P. H. Constrained systems and coding for recording channels. In An Introduction to Coding for Constrained Systems (eds Pless, V. & Huffman, W. C.) (Elsevier, 1998).
Gimpel, A. L., Stark, W. J., Heckel, R. & Grass, R. N. A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat. Commun. 14, 6026 (2023).
Bohlin, J., Rose, B. & Pettersson, J. H. O. Estimation of AT and GC content distributions of nucleotide substitution rates in bacterial core genomes. Big Data Anal. 4, 1–11 (2019).
Weindel, F., Gimpel, A. L., Grass, R. N. & Heckel, R. Embracing errors is more effective than avoiding them through constrained coding for DNA data storage. In 2023 59th Annual Allerton Conference on Communication, Control, and Computing 1–8 (IEEE, 2023).
Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3, lqab019 (2021).
Ping, Z. et al. Towards practical and robust DNA-based data archiving using the yin-yang codec system. Nat. Comput. Sci. 2, 234–242 (2022).
Chaykin, G., Furman, N., Sabary, O., Ben-Shabat, D. & Yaakobi, E. DNA-storalator: end-to-end DNA storage simulator. In 13th Annual Non-volatile Memories Workshop (2022).
Xiao, T. et al. Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. 34, 30392–30400 (2021).
Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).
Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (IEEE, 2017).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017).
Menin, J. F. & Nichols, N. M. Multiplex PCR using Q5 High-Fidelity DNA Polymerase (New England Biolabs, 2013).
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614–620 (2014).
Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39, 1348–1365 (2021).
Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Datasets of scalable and robust DNA-based storage via coding theory and deep learning. Zenodo https://doi.org/10.5281/zenodo.13896773 (2024).
Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Code repository for scalable and robust DNA-based storage via coding theory and deep learning. Zenodo https://doi.org/10.5281/zenodo.14266018 (2024).
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Dataset for “Robust chemical preservation of digital information on DNA in silica with error-correcting codes”. Zenodo https://doi.org/10.5281/zenodo.14290754 (2015).
Grass, R. N. et al. Dataset for “Robust chemical preservation of digital information on DNA in silica with error-correcting codes”. Zenodo https://doi.org/10.5281/zenodo.14290755 (2015).
Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Prepared binned DNA data storage datasets for reconstruction benchmarking. Zenodo https://doi.org/10.5281/zenodo.14296588 (2024).
Bornholt, J. et al. A DNA-based archival storage system. In Proc. 21st International Conference on Architectural Support for Programming Languages and Operating Systems 637–649 (ACM, 2016).
Blawat, M. et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016).
Wagner, R. A. & Fischer, M. J. The string-to-string correction problem. J. ACM 21, 168–173 (1974).
Acknowledgements
D.B.-L., I.O., O.S. and E.Y. were Funded by the European Union (ERC, DNAStorage, 101045114 and EIC, DiDAX 101115134). The views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. D.B.-L. and T.E. were supported in part by ISF grant no. 222/19. The authors thank S. Goldberg, N. Kikuchi and R. Amit for assistance with the wet experiments and providing the lab and equipment; N. Fourier and L. Linde for their help in the sequencing processes and D. B. Shabat for his assistance and advice. Finally, we thank A. Yucovich for his help with developing the CPL algorithm.
Author information
Authors and Affiliations
Contributions
Conceptualization: D.B.-L., I.O. and O.S. Methodology: D.B.-L., I.O. and O.S. Investigation: D.B.-L., I.O., O.S., T.E. and E.Y. Supervision: T.E. and E.Y. Writing, review and editing: D.B.-L., I.O., O.S., T.E. and E.Y.
Corresponding authors
Ethics declarations
Competing interests
D.B.-L., I.O., O.S., T.E. and E.Y. are inventors on patent application US 18/233,855 submitted by the Technion–Israel Institute of Technology and Bar Ilan University.
Peer review
Peer review information
Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Material
Supplementary Results, Figs. 1–13 and Tables 1–10.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bar-Lev, D., Orr, I., Sabary, O. et al. Scalable and robust DNA-based storage via coding theory and deep learning. Nat Mach Intell 7, 639–649 (2025). https://doi.org/10.1038/s42256-025-01003-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-025-01003-z