|Data Deduplication with Random Substitutions
|Hao Lou, Farzad Farnoud, University of Virginia, United States
|O.1: Data Compression
|Click here to download the manuscript
|Click here to watch in the Virtual Symposium
|Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide information-theoretical analysis on the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.