Grammar-based compression for strings and trees

Hucke, Danny

Citation link: http://dx.doi.org/10.25819/ubsi/610

Files in This Item:

File	Description	Size	Format
Dissertation_Danny_Hucke.pdf		1.18 MB	Adobe PDF	View/Open

Dokument Type:	Doctoral Thesis
metadata.dc.title:	Grammar-based compression for strings and trees
Other Titles:	Grammatik-basierte Kompression von Wörtern und Bäumen
Authors:	Hucke, Danny
Institute:	Institut für Theoretische Informatik
Free keywords:	Kompression, Bäume, Grammar-based compression, strings
Dewey Decimal Classification:	004 Informatik
GHBS-Clases:	TVBC TUH
Issue Date:	2019
Publish Date:	2019
Abstract:	Das Ziel grammatik-basierter Datenkompression ist es ein Wort (oder einen Text) durch eine kontextfreie Grammatik darzustellen, die ausschließlich dieses Wort erzeugt. Eine kontextfreie Grammatik mit dieser Eigenschaft wird als Straight-line Program (SLP) bezeichnet. Grammatik-basierte Kompression ist ein nützliches Werkzeug um Daten zu komprimieren und verschiedene Anfragen effizient mit Hilfe der komprimierten Repräsentation zu beantworten ohne diese vorher zu dekomprimieren. Im ersten Teil dieser Arbeit werden die grammatikbasierten Kompressoren LZ78, BiSection, RePair, Greedy und LongestMatch untersucht, die zu den bekanntesten Algorithmen in diesem Bereich zählen. In der bedeutenden Veröffentlichung "The smallest grammar problem" von Charikar et al. wurden untere und obere Schranken für die Approximationsgüte verschiedenster grammatik-basierter Kompressoren gezeigt, darunter die oben genannten. Leider besteht für jeden Algorithmus eine Lücke zwischen der unteren und oberen Schranke. In dieser Arbeit werden diese Lücken für die Kompressoren LZ78 und BiSection geschlossen. Des Weiteren werd für RePair und für Greedy die unteren Schranken verbessert. Ein weiteres Ergebnis dieser Arbeit verbessert ein Resultat von Arpe und Reischuk, welches grammatik-basierte Kompression über beliebigen Alphabeten und binären Alphabeten verbindet. Im zweiten Teil dieser Arbeit betrachten wir grammatik-basierte Kompression für Bäume. Das Prinzip ist ähnlich, da in diesem Kontext ein Baum durch eine lineare, kontextfreie Baumgrammatik erzeugt wird. Eine solche Grammatik wird Tree Straight-line Program (TSLP) genannt. Als Hauptresultat in diesem Zusammenhang werden zwei Algorithmen präsentiert, die für einen Baum mit n Knoten und konstant vielen verschiedenen Beschriftungen der Knoten ein TSLP der Gr öße O(n/log n) generieren. Zusätzlich wird erreicht, dass das TSLP logarithmische Tiefe hat. Diese Eigenschaften können in logarithmischen Platz oder alternativ in linearer Zeit erreicht werden. Entsprechende Resultate für grammatik-basierte Kompression von Wörtern sind wohl bekannt. Es werden zusätzlich zwei Anwendungen präsentiert: Zuerst werden TSLPs für die Umwandlung von arithmetischen Formeln zu arithmetischen Schaltkreisen der Größe O((nlog m)/log n) und Tiefe O(log n) genutzt, wobei n die Größe der Formel und m die Anzahl verschiedener Variablen in der Formel ist. Als zweite Anwendung wird eine binäre Kodierung von unbeschrifteten Binärbäumen auf der Basis von grammatik-basierter Baumkompression präsentiert. Es wird gezeigt, dass diese Kodierung unter gewissen Voraussetzungen asymptotisch optimal ist. The goal of grammar-based compression is to represent a string by a small context-free grammar that produces only this string. Such a grammar is called a straight-line program (SLP). Grammar-based compression is a powerful tool to efficiently store data and process the compressed representation without decompressing it. In the first part of this work, we study the grammar-based compressors LZ78, BiSection, Repair, Greedy and LongestMatch, which are among the most popular compressors in this area. In the seminal work "The smallest grammar problem" by Charikar et al., the authors derived lower and upper bounds on the approximation ratios of several grammar-based compressors including the algorithms mentioned above. Unfortunately, for none of the compressors the presented bounds matched. Here, we close the gaps for LZ78 and BiSection. For RePair and Greedy we improve the lower bounds. Moreover, we improve a result of Arpe and Reischuk which relates grammar-based compression for arbitrary alphabets and binary alphabets. In the second part of this work, we consider grammar-based compression for trees. The main principle is similar, because the goal is to represent a tree by a small linear context-free tree grammar that produces only this tree. Such a tree grammar is called a tree straight-line program (TSLP). As a main contribution, we present two algorithms that produce a TSLP of size O(n/log n) for any given tree with n nodes and a constant set of node labels, where we assume that the maximal number of children of a node in the tree is also bounded by a constant. Additionally, the obtained TSLP has logarithmic depth. We show that those properties can be achieved in logarithmic space, or alternatively, in linear time. Similar results on the worst-case size of SLPs are well known. We use our constructions for two applications: First, we apply TSLPs to the problem of transforming arithmetical formulas into equivalent circuits of size O((nlog m)/log n) and depth O(log n), where n is the size of the formula and m is number of different variables occurring in the formula. As a second application, we present a binary encoding of unlabeled binary trees based on grammar-based tree compression. We prove that this encoding is worst-case universal and thus asymptotically optimal for certain tree sources.
DOI:	http://dx.doi.org/10.25819/ubsi/610
URN:	urn:nbn:de:hbz:467-15361
URI:	https://dspace.ub.uni-siegen.de/handle/ubsi/1536
License:	http://creativecommons.org/publicdomain/zero/1.0/
Appears in Collections:	Hochschulschriften

This item is protected by original copyright

View License

Show full item record

Page view(s)

655

checked on Apr 3, 2025

Download(s)

304

checked on Apr 3, 2025

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License

Opus Siegen

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Altmetric

Google Scholar^TM