DATA COMPRESSION INTERFACE STANDARD =================================== Author : Ross Williams (ross@spam.ua.oz.au). Date : 20-Jun-1991. CONTENTS -------- 1. Introduction 1.1 Summary of the Standard 1.2 Motivation for the Standard 1.3 Benefits of the Standard 1.4 Scope of the Standard 1.5 Definitions 2. Interface Specification 2.1 Structure of the Interface 2.2 Parameters 2.3 Use of Memory 2.4 Safety Requirement 2.5 Safety Recommendations 2.6 Identity Record 3. Administration 3.1 Introduction 3.2 Revision History 3.3 Child Standards 3.4 Version Control 1. INTRODUCTION --------------- 1.1 Summary of Standard ------------------------ This standard defines a standard interface for the communication between computer programs using data compression algorithms and implementations of data compression algorithms. The standard constrains each data compression algorithm to be implemented as a single procedure and defines the information that must pass through the procedure's parameter list, and the behaviour of the procedure. The standard is programming language independent, seeking not to control the details of the interface but rather the structure of the interface and the nature of the information passed through it. 1.2 Motivation for the Standard ------------------------------- There are now so many data compression algorithms with so many different characteristics, written in so many different languages, on so many different systems that there is pressing need for a standard data compression algorithm interface. Currently it is no small task to port a data compression algorithm that has been tailored for one particular application or environment to another. This portability difficulty may be justified in messier domains such as user interfaces, but there seems to be no need for this in a domain as generic (the manipulation of data) as data compression. 1.3 Benefits of the Standard ---------------------------- Some benefits that will flow from this standard are listed below. * The standard interface will allow programmers to write code that uses data compression without concerning themselves with the details of particular data compression algorithms. The programmer will be able to write a program using the interface and then quickly swap between various algorithms until the one best suited to the application is found. * The standard interface will assist those cataloging and comparing algorithms. * The standard interface will encourage the designers of data compression algorithms to provide conforming implementations. * The standard interface will make it easy for programmers to write programs that select from a range of data compression algorithms at run time. 1.4 Scope of the Standard ------------------------- This standard seeks to constrain the manner in which user programs communicate with data compression algorithms. The standard is defined in terms of information flows and protocols and does not restrict itself to the details of any one particular programming language. However, the standard does distinguish the C programming language as the preferred language for benchmark showcase implementations. Although minor changes in the interfacing will be required when converting a conforming algorithm from one language to another, the standard ensures that this process will be far easier than it would have been if the user programs in each language had picked a different interface paradigm. The standard does not seek to control the format of compressed data. 1.5 Definitions --------------- Algorithm - In this document short for "data compression algorithm" which is an abstract procedural specification for an operation on data. It is important to distinguish between an algorithm and an implementation of an algorithm. Algorithm implementation - A procedure written in a particular programming language that implements a particular abstract data compression algorithm. Byte - In this document, a byte is defined to be eight bits. Memory is measured in bytes. Conforming Implementation - An implementation of a data compression algorithm that conforms to this standard. Conforming Procedure (CP) - A procedure written in a particular programming language that conforms to this standard. Date string - A date string is a standard string of length 11 having the format "dd-mmm-yyyy" where dd is in the range "01".."31", mmm is in the range "Jan","Feb",.."Dec" (Case dependent), and yyyy is in the range "1900" and "9999". MAX_EXPANSION - This is a parameter of the standard which is currently set at 1024 bytes. The parameter specifies the maximum expansion (measured in bytes) per compressed block of data that a data compression algorithm is permitted to make. Printable character - A printable character is defined to be a byte in the range [32,126]. In the ASCII character set, this corresponds to the set of printable characters [' '..'~']. Note: On non-ASCII machines this definition still stands --- the byte values are still to be used. To print such a character on a non-ASCII machine will require some kind of conversion. Procedure - A procedure written in a conventional imperative programming language. RW - Ross Williams (ross@spam.ua.oz.au) Standard string - A standard string is a variable length array of from 0 to 255 bytes having values that are members of the subrange of printable characters. User program - Any computer program employing a data compression algorithm implementation. 2. INTERFACE SPECIFICATION -------------------------- 2.1 Structure of the Interface ------------------------------ 2.1.1 A conforming implementation of a data compression algorithm consists of one or more procedures, one of which is presented to the user program as "the conforming procedure". This standard is not concerned with the implementation of the conforming procedure, only with its parameter list and its behaviour. The remainder of this standard treats the conforming procedure and all the procedures and functions that it might call, as a single procedure. 2.1.2 A conforming procedure communicates only through its parameter list. 2.2 Parameters -------------- 2.2.1 A conforming procedure must have a parameter list that conveys no more and no less information than that conveyed by the following "model" parameter list. IN action - The action requested of the data compression procedure. INOUT memory - A block of memory for use by the algorithm. IN fresh - Is this the start of a new context block? IN inblock - An array of input bytes. OUT outblock - An array of output bytes. OUT identity - Record specifying attributes of the algorithm. The keywords IN, INOUT, and OUT are used in the Ada sense to describe the flow of information at the procedure interface: IN allows the procedure only to read a parameter. OUT allows the procedure only to write the parameter. INOUT allows the procedure to both read and write the parameter. The fact that the procedure has a particular power (i.e. reading and or writing) over a parameter does not require the procedure to exercise that power. 2.2.2 The parameter list in a particular programming language may look quite different, with different names for parameters and perhaps fewer or more parameters (e.g the input block might be passed as two parameters being a pointer to the start of the block and the length of the block) and possibly in a different order. However, the information flowing across the interface should be the same. 2.2.3 ACTION: The action parameter determines the major purpose of the call and takes one of the following three values which are ideally implemented using an enumeration type: Identity Compress Decompress The three values command the following actions: Action "Identity" instructs the conforming procedure to write into the parameter "identity", a record value giving information about the underlying algorithm and its characteristics. The remaining parameters (fresh,model,inblock,outblock) must not be read or written. Action "Compress" instructs the conforming procedure to write a compressed representation of the input block to the output block. Action "Decompress" instructs the conforming procedure to write a decompressed representation of the input block to the output block. 2.2.4 MEMORY: This parameter is a block of memory handed by the user program to the data compression algorithm for use during compression. 2.2.5 FRESH: If and only if this boolean parameter is false, a conforming procedure can assume that the parameter "memory" is the output from a previous call to the conforming procedure in the same mode (i.e. compression or decompression). With this information, the algorithm may choose to use the model present in the parameter "memory" at the end of the previous block of data to operate upon the next block of data. 2.2.6 INBLOCK, OUTBLOCK - These are blocks of memory handed to the compression algorithm for reading and writing. Inblock must only ever be read. Outblock must only ever be written. In this description of the standard, Inblock and Outblock are treated as dynamically-variable-length arrays of bytes. The length of such an array X can be obtained with the length function length(X). Conforming algorithms on a particular implementation must be able to process the largest blocks of input that could possibly be handed to them. If an algorithm can only process (say) 64K, then it should be wrapped in a loop before being encapsulated in a conforming procedure. The standard does not place any constraint on the degree to which a decompression operation can expand the data. It is assumed that if the user compressed the data, the user knows how big it will be when it is decompressed. If this causes trouble, the user can explicitly store the length of the decompressed data along with the compressed data. 2.2.7 IDENTITY - This parameter allows the conforming procedure to return information about its algorithm. Upon return, this parameter has an undefined value unless the parameter "action" had the value "identity" in which case the parameter contains a record value conveying information about the algorithm. In particular, one of the fields of the record contains the amount of memory that the algorithm requires in the "memory" parameter for compression and decompression. The record is discussed in more detail in section 2.6. 2.2.8 This description of these parameters has been highly informal. A more formal definition of the semantics of these parameters can be found in section 2.4.2. The specification in 2.4.2 supercedes much of the informal discussion above. 2.3 Use of Memory ----------------- 2.3.1 A conforming procedure must read and write only the memory in the parameters of its parameter list with the exception of some "extra" memory (which can take the form of registers or local stack variables) which it may use for temporary use, subject to the following conditions: 1) The particular language/environment must ensure that the memory will be available (i.e. failure during allocation is impossible). 2) The total size of the extra memory does not exceed 1024 bytes. 3) The memory is not used to transfer information in or out of the instantiation of the conforming procedure. 2.3.2 A conforming procedure may use the memory contained in the "memory" parameter in any way it chooses. However, if a conforming procedure uses the "fresh" parameter to determine whether there is a model present in the "memory" parameter, then it must also ALWAYS write a model it can make sense of in the "memory" parameter before terminating. See 2.4.2 for a more formal definition. 2.3.3 It should be noted that the presence and semantics of the "memory" parameter allows multiple instances of a compression algorithm to be running at the same time, thus allowing multiple streams of data to be processed in an interleaved fashion. 2.4 Safety Requirement ---------------------- 2.4.1 Perhaps surprisingly, this standard does not require that a conforming data compression algorithm implementation compress data, as this is a mathematical impossibility for all data values. Instead, the standard requires only that the candidate compression procedure provide a reversible transformation and that it not expand the data by more than a fixed constant amount MAX_EXPANSION bytes. 2.4.2 Conforming procedures must satisfy the following safety constraint. BEGIN DEFINITION WHERE n : a constant integer in the range [1,infinty). type bytearray = finite but variable-length array of bytes. type vector = array(1..n) of bytearray. A : variable vector. B : constant set of vectors. W,X,Y,Z : constant vector. ID : constant identity record returned when action=identity. CM,DM : constant non-negative integer. model1 : variable type bytearray of size>=ID.mem_compress. model2 : variable type bytearray of size>=ID.mem_decompress. dummy : polymorphic dummy parameter list placeholder. walrus : a candidate as a conforming procedure. THE STANDARD DEFINES that for a candidate data compression procedure named "walrus" to conform to the standard, there must exist exactly one non-empty set B for each possible value of A such that: {X=A} walrus(compress,true ,model1,X1,Y1,dummy); walrus(compress,false,model1,X2,Y2,dummy); walrus(compress,false,model1,X3,Y3,dummy); ... walrus(compress,false,model1,Xn,Yn,dummy); {Y in B, AND FORALL i in 1..n, length(Yi)<=length(Ai)+MAX_EXPANSION} AND {W in B} walrus(decompress,true ,model2,W1,Z1,dummy); walrus(decompress,false,model2,W2,Z2,dummy); walrus(decompress,false,model2,W3,Z3,dummy); ... walrus(decompress,false,model2,Wn,Zn,dummy); {Z=A} AND ID.deterministic=true => cardinality(B)=1 END DEFINITION 2.4.3 By making B a set, the definition above admits non-deterministic algorithms (such as the LZRW1 algorithm) so long as they advertise the fact by setting ID.deterministic=true. A non-deterministic algorithm is one that non-deterministically chooses one of many representations (possibly of differing lengths) for at least one possible input value. 2.4.4 The clause FORALL i in 1..n, length(Yi)<=length(Ai)+MAX_EXPANSION places a limit to the amount of expansion that a data compression algorithm can perform during a compression operation. This requirement means that algorithms that find that they have expanded data by more than MAX_EXPANSION bytes will have to find some other method of representing the data. As a last resort, an algorithm can always set a copy flag at the start of the compressed block and simply represent the data as itself. 2.5 Safety Recommendations -------------------------- 2.5.1 It is recommended that algorithms encode the following information into their memory parameter for later checking. * The algorithm id. * A randomly chosen context block identification. * The sequence number of the next block in the context. * Whether the block was used for compression or decompression. * A checksum of the block (to detect corruptions by the user). 2.5.2 It is recommended that algorithms encode the following information into compressed blocks for later checking. * The algorithm id. * A randomly chosen context block identification. * A sequence number for each block within a context. * Length and checksum of the each compressed block. * Length and checksum of the each decompressed block. 2.5.3 These measures will go a long way to ensuring that the algorithm is being used according to specification. These recommendations are recommendations only and cannot and should not be enforced as this standard does not place any constraints on data formats. Also, some of these recommendations may be prohibitively inefficient in some applications. 2.6 Identity Record ------------------- 2.6.1 This section defines the information contained in identity records returned in the parameter "identity" by conforming procedures in response to a call with action=identity. The standard does not define the exact details of the record, which will vary from language to language. However, it is expected that the field names specified in the various child standards will be very similar to the ones suggested below. 2.6.2 The identity value returned by a conforming procedure in one particular call must be the same as the value returned in all other calls. 2.6.3 The identity record must contain the information contained in the fields of the following "model" record: record id : natural in 0..(2^31)-1 mem_compress : natural mem_decompress : natural deterministic : boolean name : standard string version : standard string date_creation : date string date_release : date string author : standard string author_address : standard string reference : standard string mechanism : standard string usage : standard string legal : standard string note : standard string rec_min_context : natural rec_min_block : natural end record 2.6.5 This section contains a brief description of each field. 2.6.4.1 id: The "id" field is a 31 bit value being a unique identification number of the underlying algorithm. Designers of algorithms must choose an id for their algorithm by tossing a fair coin 31 times, once for each bit of the id. The tossing of a coin is a requirement of this standard. Under no circumstances is a computer's random number generator to be used to generate an id. In the unlikely event that two algorithm designers have chosen the same id number, one of the algorithm's numbers will be changed manually. There may be many conforming procedures implemented in the same or different languages that have the same id. This is allowed so long as they: 1) Can all decompress each other's data. 2) All use the same amount of memory. 3) Are either all deterministic or non-deterministic. 2.6.4.2 mem_compress: The number of bytes of memory required by the algorithm during a compression operation. 2.6.4.3 mem_decompress: The number of bytes of memory required by the algorithm during a decompression operation. 2.6.4.4 deterministic: A boolean flag which is TRUE iff the algorithm is guaranteed to generate at most one compressed representation for each distinct input block. 2.6.4.5 name: The name of the algorithm (e.g. "LZRW1"). Note that the identity of the algorithm rests in its id number, not its name. 2.6.4.6 version: Version information. 2.6.4.7 date_creation: The date on which this particular procedure was first released. 2.6.4.8 date_update: The date of the most recent update of this procedure. 2.6.4.9 author: The name of the author or publisher of the algorithm. 2.6.4.10 author_address: How to contact the author. 2.6.4.11 reference: A reference to a publication describing the algorithm. 2.6.4.12 mechanism: A brief description of the way the algorithm works. 2.6.4.13 usage: Hints on how the algorithm is best used. 2.6.4.14 legal: A message about the legal status of the code and the algorithm. 2.6.4.15 note: Miscellaneous information about the algorithm. 2.6.4.16 rec_min_context: RECOMMENDED minimum length for context blocks. Use of blocks of less than this length will seriously impair compression. Note: The user is free to ignore this recommendation. 2.6.4.17 rec_min_block: RECOMMENDED number of bytes of overhead caused by closing off a block (i.e. flushing the coder). Note: The user must not RELY on this information in any way. 3. ADMINISTRATION ----------------- 3.1 Introduction ---------------- This section contains the administrative details of the standard. It outlines what is expected of those constructing conforming data compression algorithm implementations and records the details of the standard's history. 3.2 Revision History -------------------- This section gives the revision history of this standards document. Date Name Action ---- ---- ------ 19-Jun-1991 RW Created the first version of the standard. 3.3 Child Standards ------------------- Although this standard specifies a generic language-independent interface for data compression algorithm implementations, it is hoped that child standards (Note: "substandard"s sounds bad!) will be created for each programming language in which conforming data compression algorithm implementations are written. Each child standard should specify a procedure interface tightly enough to allow total interchangeability of conforming procedures. Although algorithm designers are free to implement conforming procedures in whatever language they choose, it is hoped that they will additionally create a version in C conforming to the C child-standard. While this may seem inconvenient at the time, in the long term, it will result in a large pool of interchangeable algorithms whose performance characteristics on particular kinds of data on particular kinds of machines can quickly be compared. 3.4 Version Control ------------------- Once an algorithm designer has designed an algorithm, chosen an id number and released the algorithm for others to use, there is only a restricted set of changes that can be made to the algorithm without having to define a new algorithm with a new id. If an algorithm is modified, the revised algorithm should have either a different id or a different version number. A new algorithm with a new id must be created if one or more of the following hold: 1 The old algorithm is unable to decompress any compressed data generated by the new algorithm. 2 The new algorithm is unable to decompress any compressed data generated by the old algorithm. 3 The algorithm has changed from deterministic to non-deterministic. (Note: It is possible to do this without violating 1 or 2). 4 The new algorithm requires more memory than the old algorithm. These rules are designed to protect systems that have come to rely on particular properties of an algorithm. Once an algorithm has been created and been allocated an id, this standard does not support its destruction. ----