DATA COMPRESSION INTERFACE STANDARD
===================================
Author : Ross Williams (ross@spam.ua.oz.au).
Date   : 20-Jun-1991.

CONTENTS
--------
1. Introduction
   1.1 Summary of the Standard
   1.2 Motivation for the Standard
   1.3 Benefits of the Standard
   1.4 Scope of the Standard
   1.5 Definitions

2. Interface Specification
   2.1 Structure of the Interface
   2.2 Parameters
   2.3 Use of Memory
   2.4 Safety Requirement
   2.5 Safety Recommendations
   2.6 Identity Record

3. Administration
   3.1 Introduction
   3.2 Revision History
   3.3 Child Standards
   3.4 Version Control


1. INTRODUCTION
---------------

1.1 Summary of Standard
------------------------
This  standard  defines a  standard  interface  for the  communication
between  computer  programs  using  data  compression  algorithms  and
implementations   of  data   compression   algorithms.  The   standard
constrains  each data  compression  algorithm to  be implemented  as a
single procedure  and defines the  information that must  pass through
the procedure's  parameter list, and  the behaviour of  the procedure.
The  standard  is programming  language  independent,  seeking not  to
control the details  of the interface but rather the  structure of the
interface and the nature of the information passed through it.

1.2 Motivation for the Standard
-------------------------------
There  are  now so  many  data  compression  algorithms with  so  many
different characteristics, written in  so many different languages, on
so many different  systems that there is pressing need  for a standard
data compression algorithm interface. Currently it is no small task to
port  a data  compression algorithm  that  has been  tailored for  one
particular  application or  environment to  another. This  portability
difficulty  may  be   justified  in  messier  domains   such  as  user
interfaces, but  there seems  to be no  need for this  in a  domain as
generic (the manipulation of data) as data compression.

1.3 Benefits of the Standard
----------------------------
Some benefits that will flow from this standard are listed below.

  * The standard interface  will allow programmers to  write code that
    uses  data  compression  without concerning  themselves  with  the
    details of particular data  compression algorithms. The programmer
    will  be able  to write  a program  using the  interface and  then
    quickly swap between various algorithms  until the one best suited
    to the application is found.

  * The standard interface will  assist those cataloging and comparing
    algorithms.

  * The  standard  interface  will  encourage the  designers  of  data
    compression algorithms to provide conforming implementations.

  * The standard interface will make  it easy for programmers to write
    programs that select  from a range of  data compression algorithms
    at run time.

1.4 Scope of the Standard
-------------------------
This standard  seeks to  constrain the manner  in which  user programs
communicate with data compression  algorithms. The standard is defined
in  terms of  information flows  and protocols  and does  not restrict
itself  to the  details of  any one  particular programming  language.
However, the standard  does distinguish the C  programming language as
the preferred language for benchmark showcase implementations.

Although  minor  changes in  the  interfacing  will be  required  when
converting a  conforming algorithm from  one language to  another, the
standard ensures  that this process will  be far easier than  it would
have been if the user programs in each language had picked a different
interface paradigm.

The standard does not seek to control the format of compressed data.

1.5 Definitions
---------------

Algorithm -  In this document  short for "data  compression algorithm"
    which is an abstract procedural  specification for an operation on
    data. It is  important to distinguish between an  algorithm and an
    implementation of an algorithm.

Algorithm  implementation  -  A  procedure  written  in  a  particular
    programming language  that implements  a particular  abstract data
    compression algorithm.

Byte - In this document, a byte is defined to be eight bits. Memory is
    measured in bytes.

Conforming Implementation  - An  implementation of a  data compression
    algorithm that conforms to this standard.

Conforming  Procedure  (CP) -  A  procedure  written in  a  particular
    programming language that conforms to this standard.

Date string - A  date string is a standard string  of length 11 having
    the format "dd-mmm-yyyy" where dd  is in the range "01".."31", mmm
    is in the range "Jan","Feb",.."Dec"  (Case dependent), and yyyy is
    in the range "1900" and "9999".

MAX_EXPANSION - This is a parameter of the standard which is currently
    set at 1024  bytes. The parameter specifies  the maximum expansion
    (measured  in bytes)  per compressed  block  of data  that a  data
    compression algorithm is permitted to make.

Printable character - A printable character is defined to be a byte in
    the range [32,126].  In the ASCII character  set, this corresponds
    to the set of printable  characters [' '..'~']. Note: On non-ASCII
    machines  this definition  still stands  --- the  byte values  are
    still to be used. To print such a character on a non-ASCII machine
    will require some kind of conversion.

Procedure  -   A  procedure  written  in   a  conventional  imperative
    programming language.

RW - Ross Williams (ross@spam.ua.oz.au)

Standard string - A standard string is a variable length array of from
    0 to 255  bytes having values that are members  of the subrange of
    printable characters.

User program  -  Any computer  program  employing  a data  compression
    algorithm implementation.

2. INTERFACE SPECIFICATION
--------------------------

2.1 Structure of the Interface
------------------------------
2.1.1  A conforming  implementation  of a  data compression  algorithm
consists of one  or more procedures, one of which  is presented to the
user  program as  "the  conforming procedure".  This  standard is  not
concerned with  the implementation  of the conforming  procedure, only
with  its parameter  list and  its  behaviour. The  remainder of  this
standard treats  the conforming procedure  and all the  procedures and
functions that it might call, as a single procedure.

2.1.2 A  conforming procedure communicates only  through its parameter
list.

2.2 Parameters
--------------
2.2.1 A conforming  procedure must have a parameter  list that conveys
no more  and no less information  than that conveyed by  the following
"model" parameter list.

   IN    action    - The action requested of the data compression procedure.
   INOUT memory    - A block of memory for use by the algorithm.
   IN    fresh     - Is this the start of a new context block?
   IN    inblock   - An array of input  bytes.
   OUT   outblock  - An array of output bytes.
   OUT   identity  - Record specifying attributes of the algorithm.

The keywords IN, INOUT, and OUT are  used in the Ada sense to describe
the flow of information at the procedure interface:

   IN    allows the procedure only to read a parameter.
   OUT   allows the procedure only to write the parameter.
   INOUT allows the procedure to both read and write the parameter.

The fact that  the procedure has a particular power  (i.e. reading and
or  writing)  over a  parameter  does  not  require the  procedure  to
exercise that power.

2.2.2 The parameter list in a particular programming language may look
quite different, with different names for parameters and perhaps fewer
or  more parameters  (e.g  the  input block  might  be  passed as  two
parameters being a pointer to the start of the block and the length of
the block) and possibly in a different order. However, the information
flowing across the interface should be the same.

2.2.3 ACTION: The action parameter determines the major purpose of the
call and  takes one of  the following  three values which  are ideally
implemented using an enumeration type:

   Identity
   Compress
   Decompress

The three values command the following actions:

Action "Identity" instructs the conforming procedure to write into the
parameter  "identity", a  record  value giving  information about  the
underlying algorithm and its characteristics. The remaining parameters
(fresh,model,inblock,outblock) must not be read or written.

Action  "Compress"  instructs  the  conforming  procedure  to  write a
compressed representation of the input block to the output block.

Action  "Decompress" instructs  the  conforming  procedure to  write a
decompressed representation of the input block to the output block.

2.2.4 MEMORY: This  parameter is a block of memory  handed by the user
program to the data compression algorithm for use during compression.

2.2.5  FRESH:  If and  only  if  this  boolean parameter  is  false, a
conforming procedure  can assume  that the  parameter "memory"  is the
output from  a previous call to  the conforming procedure in  the same
mode (i.e.  compression or decompression). With  this information, the
algorithm  may  choose to  use  the  model  present in  the  parameter
"memory" at the end of the previous  block of data to operate upon the
next block of data.

2.2.6 INBLOCK,  OUTBLOCK - These  are blocks  of memory handed  to the
compression algorithm for reading and  writing. Inblock must only ever
be read.  Outblock must only ever  be written. In this  description of
the    standard,    Inblock    and    Outblock    are    treated    as
dynamically-variable-length  arrays of  bytes. The  length of  such an
array X can be obtained with the length function length(X).

Conforming algorithms on  a particular implementation must  be able to
process the largest  blocks of input that could possibly  be handed to
them. If  an algorithm can only  process (say) 64K, then  it should be
wrapped in a loop before being encapsulated in a conforming procedure.

The standard  does not place any  constraint on the degree  to which a
decompression operation can expand the data. It is assumed that if the
user compressed the data, the user knows how big it will be when it is
decompressed. If  this causes trouble,  the user can  explicitly store
the length of the decompressed data along with the compressed data.

2.2.7 IDENTITY  - This  parameter allows  the conforming  procedure to
return information  about its  algorithm. Upon return,  this parameter
has an  undefined value  unless the parameter  "action" had  the value
"identity"  in  which  case  the parameter  contains  a  record  value
conveying information about  the algorithm. In particular,  one of the
fields of the record contains the  amount of memory that the algorithm
requires in the "memory"  parameter for compression and decompression.
The record is discussed in more detail in section 2.6.

2.2.8 This description of these parameters has been highly informal. A
more formal  definition of  the semantics of  these parameters  can be
found in section 2.4.2. The  specification in 2.4.2 supercedes much of
the informal discussion above.

2.3 Use of Memory
-----------------
2.3.1 A  conforming procedure must read  and write only the  memory in
the  parameters of  its  parameter list  with  the  exception of  some
"extra" memory  (which can take the  form of registers or  local stack
variables)  which  it  may  use for  temporary  use,  subject  to  the
following conditions:

   1) The particular language/environment must ensure that the memory
      will be available (i.e. failure during allocation is impossible).
   2) The total size of the extra memory does not exceed 1024 bytes.
   3) The memory is not used to transfer information in or out of the
      instantiation of the conforming procedure.

2.3.2  A conforming  procedure may  use  the memory  contained in  the
"memory" parameter  in any  way it chooses.  However, if  a conforming
procedure uses the  "fresh" parameter to determine  whether there is a
model  present in  the "memory"  parameter, then  it must  also ALWAYS
write a  model it can make  sense of in the  "memory" parameter before
terminating. See 2.4.2 for a more formal definition.

2.3.3  It should  be  noted that  the presence  and  semantics of  the
"memory"  parameter   allows  multiple  instances  of   a  compression
algorithm  to be  running at  the  same time,  thus allowing  multiple
streams of data to be processed in an interleaved fashion.

2.4 Safety Requirement
----------------------
2.4.1  Perhaps surprisingly,  this  standard does  not require  that a
conforming data compression algorithm implementation compress data, as
this is a mathematical impossibility for all data values. Instead, the
standard  requires  only  that  the  candidate  compression  procedure
provide a reversible transformation and that it not expand the data by
more than a fixed constant amount MAX_EXPANSION bytes.

2.4.2 Conforming procedures must satisfy the following safety constraint.

BEGIN DEFINITION
WHERE
   n        : a constant integer in the range [1,infinty).
   type bytearray = finite but variable-length array of bytes.
   type vector    = array(1..n) of bytearray.
   A        : variable   vector.
   B        : constant   set of vectors.
   W,X,Y,Z  : constant   vector.
   ID       : constant   identity record returned when action=identity.
   CM,DM    : constant   non-negative integer.
   model1   : variable   type bytearray of size>=ID.mem_compress.
   model2   : variable   type bytearray of size>=ID.mem_decompress.
   dummy    :            polymorphic dummy parameter list placeholder.
   walrus   :            a candidate as a conforming procedure.

THE STANDARD DEFINES  that for a candidate  data compression procedure
named "walrus"  to conform to  the standard, there must  exist exactly
one non-empty set B for each possible value of A such that:

   {X=A}
   walrus(compress,true ,model1,X1,Y1,dummy);
   walrus(compress,false,model1,X2,Y2,dummy);
   walrus(compress,false,model1,X3,Y3,dummy);
   ...
   walrus(compress,false,model1,Xn,Yn,dummy);
   {Y in B, AND FORALL i in 1..n, length(Yi)<=length(Ai)+MAX_EXPANSION}

AND

   {W in B}
   walrus(decompress,true ,model2,W1,Z1,dummy);
   walrus(decompress,false,model2,W2,Z2,dummy);
   walrus(decompress,false,model2,W3,Z3,dummy);
   ...
   walrus(decompress,false,model2,Wn,Zn,dummy);
   {Z=A}

AND

   ID.deterministic=true => cardinality(B)=1

END DEFINITION

2.4.3 By making B a set, the definition above admits non-deterministic
algorithms (such as the LZRW1 algorithm) so long as they advertise the
fact by  setting ID.deterministic=true. A  non-deterministic algorithm
is one that non-deterministically  chooses one of many representations
(possibly of differing lengths) for at least one possible input value.

2.4.4 The clause

   FORALL i in 1..n, length(Yi)<=length(Ai)+MAX_EXPANSION

places a  limit to  the amount  of expansion  that a  data compression
algorithm can perform during a compression operation. This requirement
means that algorithms  that find that they have expanded  data by more
than  MAX_EXPANSION bytes  will  have to  find  some  other method  of
representing the data. As a last resort, an algorithm can always set a
copy flag  at the start of  the compressed block and  simply represent
the data as itself.

2.5 Safety Recommendations
--------------------------
2.5.1  It   is  recommended  that  algorithms   encode  the  following
information into their memory parameter for later checking.

   * The algorithm id.
   * A randomly chosen context block identification.
   * The sequence number of the next block in the context.
   * Whether the block was used for compression or decompression.
   * A checksum of the block (to detect corruptions by the user).

2.5.2  It   is  recommended  that  algorithms   encode  the  following
information into compressed blocks for later checking.

   * The algorithm id.
   * A randomly chosen context block identification.
   * A sequence number for each block within a context.
   * Length and checksum of the each compressed block.
   * Length and checksum of the each decompressed block.

2.5.3 These measures will go a long way to ensuring that the algorithm
is being  used according  to specification. These  recommendations are
recommendations only  and cannot  and should not  be enforced  as this
standard does not place any constraints on data formats. Also, some of
these  recommendations  may  be   prohibitively  inefficient  in  some
applications.

2.6 Identity Record
-------------------
2.6.1  This  section defines  the  information  contained in  identity
records returned in the  parameter "identity" by conforming procedures
in  response to  a call  with action=identity.  The standard  does not
define the exact details of the  record, which will vary from language
to language. However, it is expected that the field names specified in
the various child standards will be very similar to the ones suggested
below.

2.6.2 The  identity value  returned by a  conforming procedure  in one
particular call  must be the same  as the value returned  in all other
calls.

2.6.3 The  identity record must  contain the information  contained in
the fields of the following "model" record:

   record
     id              : natural in 0..(2^31)-1
     mem_compress    : natural
     mem_decompress  : natural
     deterministic   : boolean
     name            : standard string
     version         : standard string
     date_creation   : date string
     date_release    : date string
     author          : standard string
     author_address  : standard string
     reference       : standard string
     mechanism       : standard string
     usage           : standard string
     legal           : standard string
     note            : standard string
     rec_min_context : natural
     rec_min_block   : natural
   end record

2.6.5 This section contains a brief description of each field.

2.6.4.1  id:  The  "id"  field  is  a 31  bit  value  being  a  unique
identification  number  of  the  underlying  algorithm.  Designers  of
algorithms must  choose an id  for their  algorithm by tossing  a fair
coin 31 times, once for each bit of the id. The tossing of a coin is a
requirement of this  standard. Under no circumstances  is a computer's
random number generator to be used  to generate an id. In the unlikely
event that two algorithm designers have chosen the same id number, one
of the algorithm's numbers will be changed manually. There may be many
conforming procedures  implemented in the same  or different languages
that have the same id. This is allowed so long as they:

        1) Can all decompress each other's data.
        2) All use the same amount of memory.
        3) Are either all deterministic or non-deterministic.

2.6.4.2 mem_compress:  The number of  bytes of memory required  by the
algorithm during a compression operation.

2.6.4.3 mem_decompress: The number of  bytes of memory required by the
algorithm during a decompression operation.

2.6.4.4 deterministic: A boolean flag  which is TRUE iff the algorithm
is guaranteed  to generate at  most one compressed  representation for
each distinct input block.

2.6.4.5 name: The name of the  algorithm (e.g. "LZRW1"). Note that the
identity of the algorithm rests in its id number, not its name.

2.6.4.6 version: Version information.

2.6.4.7 date_creation: The date on which this particular procedure was
first released.

2.6.4.8  date_update: The  date  of the  most  recent  update of  this
procedure.

2.6.4.9 author: The name of the author or publisher of the algorithm.

2.6.4.10 author_address: How to contact the author.

2.6.4.11  reference:  A  reference  to a  publication  describing  the
algorithm.

2.6.4.12  mechanism: A  brief  description of  the  way the  algorithm
works.

2.6.4.13 usage: Hints on how the algorithm is best used.

2.6.4.14 legal: A  message about the legal status of  the code and the
algorithm.

2.6.4.15 note: Miscellaneous information about the algorithm.

2.6.4.16  rec_min_context:  RECOMMENDED  minimum  length  for  context
blocks. Use of  blocks of less than this length  will seriously impair
compression. Note: The user is free to ignore this recommendation.

2.6.4.17 rec_min_block: RECOMMENDED number of bytes of overhead caused
by closing off a block (i.e.  flushing the coder). Note: The user must
not RELY on this information in any way.

3. ADMINISTRATION
-----------------

3.1 Introduction
----------------
This section contains  the administrative details of  the standard. It
outlines  what  is  expected  of those  constructing  conforming  data
compression algorithm  implementations and records the  details of the
standard's history.

3.2 Revision History
--------------------
This section gives the revision history of this standards document.

Date         Name  Action
----         ----  ------
19-Jun-1991  RW    Created the first version of the standard.

3.3 Child Standards
-------------------
Although  this  standard  specifies   a  generic  language-independent
interface for data compression  algorithm implementations, it is hoped
that  child  standards  (Note:  "substandard"s sounds  bad!)  will  be
created  for  each  programming  language  in  which  conforming  data
compression algorithm implementations are written. Each child standard
should specify  a procedure  interface tightly  enough to  allow total
interchangeability of conforming procedures.

Although  algorithm   designers  are  free  to   implement  conforming
procedures in  whatever language  they choose, it  is hoped  that they
will  additionally  create   a  version  in  C   conforming  to  the C
child-standard. While this  may seem inconvenient at the  time, in the
long  term,  it  will  result  in  a  large  pool  of  interchangeable
algorithms whose  performance characteristics  on particular  kinds of
data on particular kinds of machines can quickly be compared.

3.4 Version Control
-------------------
Once an  algorithm designer  has designed an  algorithm, chosen  an id
number and released  the algorithm for others to use,  there is only a
restricted set  of changes that can  be made to the  algorithm without
having to define a new algorithm with a new id.

If an algorithm is modified,  the revised algorithm should have either
a different id or a different version number.

A new algorithm  with a new id must  be created if one or  more of the
following hold:

   1 The  old algorithm  is unable to  decompress any  compressed data
     generated by the new algorithm.

   2 The  new algorithm  is unable to  decompress any  compressed data
     generated by the old algorithm.

   3 The algorithm has changed from deterministic to
     non-deterministic. (Note: It is possible  to do this without
     violating 1 or 2).

   4 The new algorithm requires more memory than the old algorithm.

These rules are designed to protect  systems that have come to rely on
particular  properties of  an algorithm.  Once an  algorithm has  been
created and been  allocated an id, this standard does  not support its
destruction.

--<End of Proposed Data Compression Interface Standard>--