RMIT University
Browse

Relative lempel-ziv factorization for efficient storage and retrieval of web collections

conference contribution
posted on 2024-10-31, 20:17 authored by Christopher Hoobin, Simon Puglisi, Justin Zobel
Compression techniques that support fast random access are a core component of any information system. Current stateof-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.

History

Volume

5

Start page

265

End page

273

Total pages

9

Outlet

Proceedings of the 38th International Conference on Very Large Data Bases

Editors

Ahmet Saçan and Nesime Tatbul

Name of conference

VLDB Endowment

Publisher

Very Large Data Bases

Place published

Turkey

Start date

2012-08-27

End date

2012-08-31

Language

English

Copyright

© 2011 VLDB Endowment

Former Identifier

2006062932

Esploro creation date

2020-06-22

Fedora creation date

2016-07-14

Usage metrics

    Scholarly Works

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC