Citation Link: https://doi.org/10.25819/ubsi/610
Grammar-based compression for strings and trees
Alternate Title
Grammatik-basierte Kompression von Wörtern und Bäumen
Source Type
Doctoral Thesis
Author
Institute
Issue Date
2019
Abstract
The goal of grammar-based compression is to represent a string by a small context-free grammar that produces only this string.
Such a grammar is called a straight-line program (SLP).
Grammar-based compression is a powerful tool to efficiently store data and process the compressed representation without decompressing it.
In the first part of this work, we study the grammar-based compressors LZ78, BiSection, Repair, Greedy and LongestMatch, which are among the most popular compressors in this area.
In the seminal work "The smallest grammar problem" by Charikar et al., the authors derived lower and upper bounds on the approximation ratios of several grammar-based compressors including the algorithms mentioned above.
Unfortunately, for none of the compressors the presented bounds matched.
Here, we close the gaps for LZ78 and BiSection.
For RePair and Greedy we improve the lower bounds.
Moreover, we improve a result of Arpe and Reischuk which relates grammar-based compression for arbitrary alphabets and binary alphabets.
In the second part of this work, we consider grammar-based compression for trees.
The main principle is similar, because the goal is to represent a tree by a small linear context-free tree grammar that produces only this tree.
Such a tree grammar is called a tree straight-line program (TSLP).
As a main contribution, we present two algorithms that produce a TSLP of size O(n/log n) for any given tree with n nodes and a constant set of node labels, where we assume that the maximal number of children of a node in the tree is also bounded by a constant.
Additionally, the obtained TSLP has logarithmic depth.
We show that those properties can be achieved in logarithmic space, or alternatively, in linear time.
Similar results on the worst-case size of SLPs are well known.
We use our constructions for two applications: First, we apply TSLPs to the problem of transforming arithmetical formulas
into equivalent circuits of size O((n*log m)/log n) and depth O(log n), where n is the size of the formula and m is number of different variables occurring in the formula.
As a second application, we present a binary encoding of unlabeled binary trees based on grammar-based tree compression.
We prove that this encoding is worst-case universal and thus asymptotically optimal for certain tree sources.
Such a grammar is called a straight-line program (SLP).
Grammar-based compression is a powerful tool to efficiently store data and process the compressed representation without decompressing it.
In the first part of this work, we study the grammar-based compressors LZ78, BiSection, Repair, Greedy and LongestMatch, which are among the most popular compressors in this area.
In the seminal work "The smallest grammar problem" by Charikar et al., the authors derived lower and upper bounds on the approximation ratios of several grammar-based compressors including the algorithms mentioned above.
Unfortunately, for none of the compressors the presented bounds matched.
Here, we close the gaps for LZ78 and BiSection.
For RePair and Greedy we improve the lower bounds.
Moreover, we improve a result of Arpe and Reischuk which relates grammar-based compression for arbitrary alphabets and binary alphabets.
In the second part of this work, we consider grammar-based compression for trees.
The main principle is similar, because the goal is to represent a tree by a small linear context-free tree grammar that produces only this tree.
Such a tree grammar is called a tree straight-line program (TSLP).
As a main contribution, we present two algorithms that produce a TSLP of size O(n/log n) for any given tree with n nodes and a constant set of node labels, where we assume that the maximal number of children of a node in the tree is also bounded by a constant.
Additionally, the obtained TSLP has logarithmic depth.
We show that those properties can be achieved in logarithmic space, or alternatively, in linear time.
Similar results on the worst-case size of SLPs are well known.
We use our constructions for two applications: First, we apply TSLPs to the problem of transforming arithmetical formulas
into equivalent circuits of size O((n*log m)/log n) and depth O(log n), where n is the size of the formula and m is number of different variables occurring in the formula.
As a second application, we present a binary encoding of unlabeled binary trees based on grammar-based tree compression.
We prove that this encoding is worst-case universal and thus asymptotically optimal for certain tree sources.
File(s)![Thumbnail Image]()
Loading...
Name
Dissertation_Danny_Hucke.pdf
Size
1.16 MB
Format
Adobe PDF
Checksum
(MD5):1d2aa511f86d5efb2e013aab1022231b
Owning collection