yEncode - A quick and dirty encoding for binaries
---------------------------------------------------------------------------
Version 1.2 - 28-Feb-2002 - by Juergen Helbing


Revisions:
v1.0, 31-Jul-2001 - Juergen Helbing (juergen@helbing.de)
v1.1, 17-Feb-2002 - Steve Blinch (yenc32@esitemedia.com)
v1.2, 28-Feb-2002 - Juergen Helbing (juergen@helbing.de)
v1.3, 05-Mar-2002 - Juergen Helbing (juergen@helbing.de)

Introduction
---------------------------------------------------------------------------
This document describes a mechanism for encoding arbitrary binary
information for transmission by electronic mail and newsgroups.  Unlike
similar encoding schemes, yEncode takes advantage of the entire 8-bit
character set, rendering output only 1-2% larger than the original binary
source.  


Motivation
---------------------------------------------------------------------------
Existing mechanisms for transmission of binary information by electronic
mail and newsgroups make use of only 7-bit ASCII text.  The resulting
encoded data are up to 40% larger than the original binary information.

yEncode intends to reduce the additional overhead of existing encoding
schemes by taking advantage of the full 8-bit character set, which has
become widely used and acceptable in Internet newsgroups.  Special
consideration is provided for specific reserved ASCII control characters to
avoid interference with existing message transfer protocols.

The overhead of yEncoded binary data can be as little as 1-2%.  


Encoding Principle
---------------------------------------------------------------------------
The encoding process represents each octet of input data with a single
corresponding encoded output character.  The ASCII value of each output
character is derived by the following simple formula:

O = (I+42) % 256

That is, the output value is equal to the ASCII value of each input
character plus 42, all modulo 256.  This reduces overhead by reducing the
number of NULL characters (ASCII 00) that would otherwise have had needed
to be escaped, since many binaries contain a disproportionately large
number of NULLs).

Under special circumstances, a single escape character (ASCII 3Dh, "=") is
used to indicate that the following output character is "critical", and
requires special handling.

Critical characters include the following:

ASCII 00h (NULL)
ASCII 0Ah (LF)
ASCII 0Dh (CR)
ASCII 3Dh (=)

> ASCII 09h (TAB)  -- removed in version (1.2)

These characters should always be escaped.  Additionally, technique used to
encode critical characters (described in the next section) provides for any
character to be escaped; yDecoder implementations should be capable of
decoding any character following an escape sequence.

The probability of occurance of these 4 characters in binary input data is
approximately 0.4%.  On average, escape sequences cause approximately 1.6%
overhead when only these 4 characters are escaped.

The carriage return/linefeed overhead for every line depends on the
developer-defined line length.  Header and trailer lines are relatively
small, and cause negligible impact on output size.  

>(1.2) Careful writers of encoders will encode TAB (09h) SPACES (20h)
>if they would appear in the first or last column of a line.
>Implementors who write directly to a TCP stream will care about the
doubling of dots in the first column - or also encode a DOT in the 
first column.


Encoding Technique
---------------------------------------------------------------------------
A typical encoding process might look something like this:


 1. Fetch a character from the input stream.  
 2. Increment the character's ASCII value by 42, modulo 256 
 3. If the result is a critical character (as defined in the previous
    section), write the escape character to the output stream and increment
    character's ASCII value by 64, modulo 256.  
 4. Output the character to the output stream.  
 5. Repeat from start.  


To facilitate transmission via existing standard protocols (most notably
NNTP), carriage return/linefeed pairs should be written to the output
stream after every n characters, where n is the desired line length.  
Typical values for n are 128 and 256.
>(1.2) See additional experience information

If a critical character appears in the nth position of a line, both the
escape character and the encoded critical character must be written to the
same line, before the carriage return/linefeed.  In this event, the actual
number of  characters in the line is equal to n+1.  Effectively, this means
that a line cannot end with an escape character, and that a line with n+1
characters must end with an encoded critical character.  


Headers and Trailers
---------------------------------------------------------------------------
Similar to other binary encoding mechanisms, yEncode makes use of special
keyword lines to mark the beginning and end of encoded data blocks.  These
blocks may be embedded in any standard 8-bit ASCII text file.  yDecoder
implementations must ignore any text outside the header/trailer blocks.

All keyword lines must begin with an escape character ('='), followed by an
ASCII 79h ('y').  This '=y' combination uniquely identifies a line as a
keyword line, since 'y' is not a valid encoded critical character.

Header and trailer keyword lines always begin with an escape character,
followed by a keyword indicating the line type, followed by any keywords
appropriate for that particular line type.

A typical header line should look similar to this:


=ybegin line=128 size=123456 name=mybinary.dat


>(1.2) Future versions of yEnc (if any) might use a different keyword
> than =ybegin. Perhaps "=ybegin2". Decoders should scan for "=ybegin "
> - with a SPACE behind =ybegin.

>(1.2) If the parameters "line=" "size=" "name=" are not present then
>the =ybegin might be part of a text-message with a discussion about
>yEnc. In such cases the decoder should assume that there is no binary. 


Header lines must always begin with the "ybegin" keyword, and contain the
typical line length, the size of the original unencoded binary (in bytes),
and the name of the original binary file.

The filename must always be the last item on the header line.  This ensures
that all characters and character sequences may be included in the filename
without interfering with other keywords.  Although quotes (ASCII 22h, '"')
are technically permitted, they are not recommended for use in filenames.

> (1.2): Leading and trailing spaces will be cut by decoders!
> (1.2): See additional experience information
> Implementors of decoders should be careful about the filename.
> It can contain non-US-ASCII-characters (80h-FFh), control-characters
> (01h..1Fh), and characters which conflict with the current platform:
> / \  < | > : ? * @  
> It can be a very long parameter (up to 256 characters).
 

A typical trailer line should look similar to this:


=yend size=123456
		

Trailer lines must always begin with the "yend" keyword, and must contain
the size of the original unencoded binary (in bytes).

The size of the original binary must be repeated in the trailer for
redundancy checking.  yDecoder implementations should compare the header
size value with both the trailer size value and the actual size of the
resulting decoded binary.  If any of these three values differ then the
attachment is corrupt, and a warning must be issued; the resulting decoded
binary must be discarded.  (1.2) See additional experience information


Verifying Integrity
---------------------------------------------------------------------------
yEncoded documents may also include a 32-bit Cyclic Redundancy Check (CRC)
value, to assist in verifying the integrity of the encoded binary data.

A CRC32 value, if present, should be included as a "crc32" keyword in the
trailer line.  Such a trailer line might look similar to this:


=yend size=123456 crc32=abcdef12
		

It should be noted that CRC32 values are not mandatory, but should, if
possible, be processed if present.  

>(1.2) See additional experience information

Sample yEncoded File Part
---------------------------------------------------------------------------
The following is an excerpt from an actual yEncoded file block:


=ybegin line=128 size=111401 name=al_larsonbw030_ball.jpg 
)_)=J*:tpsp*++++V+V**)_*m*0./0/.00/011024:44334>896:A>....
....
....
....R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴
´R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴R̴Rͩ)_
=yend size=111401 


Complete yEncoded file samples are also available at www.yenc.org.  


Multi-part Encoded Binaries
---------------------------------------------------------------------------
It is frequently desirable to split large binary files into multiple parts
for transmissio n over the Internet.  Such binaries are often rendered
unusable by missing parts and/or data corruption.

To address these problems, yEncode defines an additional keyword line,
"ypart", and several additional keywords to handle multipart binaries.

Each individual file part begins with a standard "ybegin" header line, but
an additional keyword, "part", is added to specify the part number and
identify the file as a multipart binary.

When the "part" keyword is included in a header line, the following line
must be a "ypart" keyword line which specifies information about the part. 
The "ypart" keyword line requires a "begin" and "end" keyword; these
specify the starting and ending points, in bytes, of the block in the
original file.

The file part must end with a slightly modified "ypart" trailer line.  An
additional keyword, "part", is added to specify the part number.  This part
number must match the part number found in the header line.

> (1.2) An additional keyword "total" should be also added.
> This total number must match the total number of parts found in the header 
> line. First implementation of yEnc do NOT include this parameter.

The trailer line must also contain a "pcrc32" keyword representing the
CRC32 of the preceeding encoded part.  As always, it is also desirable (but
not required) to include a "crc32" keyword representing the CRC32 of the
entire encoded binary.

Unlike single-part yEncoded documents, the "size" keyword in the trailer
lines of multipart encoded binaries must represent the size of the file
part, not the size of the entire file.  To verify integrity, a decoder
implementation must recompute the expected part size from the "begin" and
"end" keyword values in the "ypart" line.  If the expected part size
differs from the part size specified in the "yend" line, the file is
corrupt.

A sample multipart encoded binary might look similar to this:


> (1.1) =ybegin part=1          line=128 size=500000 name=mybinary.dat
> (1.2) =ybegin part=1 total=10 line=128 size=500000 name=mybinary.dat
=ypart begin=1 end=100000
.... data
=yend size=100000 part=1 pcrc32=abcdef12 


=ybegin part=5 line=128 size=500000 name=mybinary.dat
=ypart begin=400001 end=500000 
.... data
=yend size=100000 part=10 pcrc32=12a45c78 crc32=abcdef12 
		

It should be noted that if a decoder does not implement multipart support,
or fails to detect a multipart encoded binary, then it will not
successfully decode the individual file parts because the "size" keyword in
the "ybegin" line will differ from the "size" keyword in the "yend" line.

Multipart binaries are usually quite sensitive to corruption.  Transferring
hundreds of megabytes in vain, simply because a corrupt part cannot be
identified is a significant waste of bandwidth.

Using the "begin" and "end" keywords, yEncode allows decoders to identify
the position of an individual part in a larger file, which allows parts to
be combined from several different sources regardless of the part size. 
This feature is unique to yEncode, and is very easy to include in an
encoder implementation.  


Subject Line Conventions
---------------------------------------------------------------------------
Standard single-part yEncoded binaries require no special conventions for
the subject line.  It is recommended, however, that yEncoded binaries be
specifically identified as such, until the yEncode encoding format becomes
more widely implemented.

The suggested format for subject lines for single-part binaries is:

[Comment1] "filename" 12345 yEnc bytes [Comment2]

[Comment1] and [Comment2] are optional.  The filename should always be
enclosed in quotes; this allows for easy detection, even when the filename 
includes spaces or other special characters.  The word "yEnc" should be
placed in between the file size and the word "bytes".
> (1.2) see additional experience information
> Placing the word "yEnc" between filename+bytes or bytes+comment2
> is acceptable.

Multi-part archives should always be identified as such.  As with
single-part binaries, they should also be identified as yEncoded until
yEncoding becomes more mainstream.

The (strongly) recommended format for subject lines for multi-part binaries
is:

[Comment1] "filename" yEnc (partnum/numparts) [size] [Comment2]

Again, [Comment1] and [Comment2] are optional.  The [size] value is also
optional here.  The filename must be included, in quotes.  The keyword
"yEnc" is mandatory, and must appear between the filename and the size (or
Comment2, if size is omitted).  Future revisions of the draft may specify
additional information may be inserted between the "yEnc" keyword and the
opening parenthesis of the part number.  
> (1.2) see additional experience information
> Placing the word "yEnc" between (#/#)+size or size+comment2
> is acceptable.


>(1.2) Handling of corrupt messages
>(1.2) -------------------------------------------------------------------

Decoders should use error-detection whenever possible.
The user should be notified about corrupt messages.
If warnings are disabled then it is strongly recommended to store
binaries with an error-text in the filename. Examples:

picture(size-error).jpg
homemovie(crc32-error).avi
document(line-error).rtf
longmusic(missing-parts).mp3

It is acceptable to store also corrupt binaries
(some might be even partially usable).
But it is _not_ acceptable to hide detected errors from the user entirely.

yEnc has the design target to _detect_ corruption.
Advanced newsreaders might fetch corrupt messages even from other sources.



Protection and Copyright
---------------------------------------------------------------------------
The yEncode encoding method is released into the public domain.  Everyone
is permitted to copy it, to use it, and to implement it.

Neither this document nor the yEncode encoding method may be patented,
protected, or restricted in any way.  Everyone should benefit from it, and
its predecessors.

This document may be freely distributed, as long as credit remains with the
original author(s).  Do not claim that it's your own work!

Public domain example software is also available at www.yenc.org.  


Credits
---------------------------------------------------------------------------
This document has been created based on my [Juergen Helbing] own personal
experience, and help and input from a few Usenet activists.  Thanks to:

Jeremy Nixon
Curt Welch
Ed
Andrew
Stuart
JBerg
Marco d'Itri
The Meowbot
Jan Ingvoldstat
The UseFor taskforce
(others - please remind me!)
....

Draft revised (02/17/02) by Steve Blinch 
Draft extended (02/28/02) by Juergen Helbing


Conclusion
---------------------------------------------------------------------------
This is an informal proposal, not an RFC.  Your input is greatly
appreciated.  The author is just a poor programmer - with a few years of
binary experience.

Thanks for reading.

Juergen Helbing (yenc@infostar.de) 



----------------------- 
Changes from 1.1 -> 1.2
-----------------------

The "total=" parameter has been added to =ybegin

TAB is no longer a critical character
No. of critical characters is now 4 (old: 5)

Leading TABs & SPACEs, Trailing TABs and SPACEs and leading DOTs 
may be encoded as critical characters.

Additional hints for filenames
Additional hints for corrupted by size-value
Additional hints for position of "yEnc" keyword
Additional  hints for line sizes

Scanning for the keword =ybegin should scan for "=ybegin " with
a SPACE at the end - for avoiding conflicts with successor versions
of yEnc "=ybegin2 ".
Missing parameters behind =ybegin  

Handling of corrupt messages.
Mailbox changed


Changes from 1.2 -> 1.3
-----------------------
"the proceeding character" --> "the following character" (N.R.)
"modulo 255" -> "modulo 256"  (J.H.)
"should be encoded" -> "may be encoded" (J.B.)