yEnc - Origin and backgrounds

Looking at yenc.org, the suspicion arises that the whole thing was put together on the fly, and in a rather unorganised way. Those not acquainted with the development history can easily get the impression of some hasty patchwork.

In this document I will explain why yEnc was developed in the first place - and why it looks like it does. Perhaps a closer look will explain a lot. In any case, be assured yEnc is not a quick shot in the dark - it’s based on the experiences of a lot of people. And on years of the author’s experience in binary newsgroups.

** The first steps

Binary data of any kind is encoded in a big way for posting on Usenet. A big way, because for every 100 kB of data 140 kB is put inside a message. The reason for this is the UUencode format, which is the most widely spread format on Usenet - it makes 62 bytes out of every 45. This is a terrible waste of resources, and one may wonder why it hasn’t been replaced by something better long ago.

As other protocols are able to send binary data directly, the question arises why it’s not implemented in NNTP. But as serious changes in message formats and/or NNTP protocols would be necessary, this is no option. (Both taskforces disapprove.)

An alternative is encoding the data with an extended character set, in which only the really problematic characters aren’t used. Several experienced administrators and programmers believe that Usenet is actually capable of transporting all characters without loss, except for NULL, CR and LF. They encourage anyone to test if this is wholly true.

A possible first step would be to use QP-lite (quoted printable 8 bit), encoding only these three characters. But QP-lite was not specified yet - it’s only talked about. Also, several problems would arise, as a new encoding method can’t be fit in the existing standards just like that, and the behaviour of existing clients would be unclear.

On top of that, NULL is a very common character. Its encoding would produce a relatively large overhead.

** The fundamental encoding procedure

A quick look at some "obvious" binaries (jpeg, mpeg, avi, zip) shows that several codes (00, 01, FF, FE, E0, 7F, 80) occur relatively often. When using a "simple" encoding mechanism, there's a risk of an overhead that really gets out of proportions. The possibility of a complete analysis of the source data was abandoned too, because of its extreme complexity and its time-consuming nature. After all, we're talking about 500 MB video files here.

The rotation of character codes seems to be an elegant solution. Obviously, adding 42 to the critical characters would move them out of the Escape zone. (Other factors have the same effect.)

** Line sizes

NNTP servers have their limitations when it comes to line lengths. Therefore, encoded data has to be cut into reasonable chunks by CRLFs. Asking around reveals that other people prefer line lengths from 60, via 256, to 999 characters. Of course, lines should be as long as possible, as more line breaks mean more overhead. To be flexible, the decision should be left to the encoding software. So the need arises to incorporate line size information in the format. That way, it will be possible to use 100.000 character lines one day - clients unable to cope with that could just refuse to.

** Data integrity

As it's not clear if data encoded this way will be handled correctly by Usenet and connected services, integrity check measures should be taken. And as it will always occur that messages won't get through the long chains of Usenet servers without corruptions (for whatever reason), this is a nice opportunity to do something about that, too.

The solution is to incorporate the message and line sizes in the new format. For even better testing it's possible to add a CRC32 of the data.

That way, every link in the chain can be checked using message and line sizes and CRC32, and even make it possible to analyse problems. (After all, just "it's broke" isn't very helpful.)

** Single message format

To properly transfer and decode the encode data, it's necessary to fit it in a well-defined frame. As yEnc incorporates extra information for integrity checks, it needs its own header data. To avoid changes in header formats - for whatever way of transport - they'll need their own header format. The way MIME does this is exemplary, but cannot be used for our purpose, as the header data of the message itself would be needed. Apart from that, MIME is very complex and not really popular by a lot of programmers (it's not just me). I was definitively advised not to use it. For those reasons, and others, a simplified format was chosen, oriented on the existing UUencode standard: the "begin" line was formalised to make it unique and to contain all necessary information to decode the binary data correctly under all circumstances. That also works when no message header is available - or if the message is transported another way (as long as it's 8-bit). It's even possible to incorporate data in HTML/XML like this!

This way, it's also possible to add text before or after the data, or put several data blocks in one message. This would even allow a complete HTML site with all extra data (pictures) to be sent in one message.

** Multipart, splitting and joining

Many binaries today are larger than Usenet servers can or will allow. That's why messages larger than 100/250/500 kB are usually split into smaller parts.

Unfortunately, this splitting is only defined very rudimentary for UUencode - and MIME has its own peculiarities. Looking for a multipart format that all systems can handle and that can also be used for other ways of transport, it turns out to be best to encode complete blocks: the data is split into parts, each with its own verification data (lengths, CRC). Also, every part has its own begin (=ybegin) and end (=yend), so it's possible to add explanatory text before and after it (Usenet servers don't like that now) without endangering the data's integrity.

Furthermore, making sure every single part has its own integrity checks has an extra advantage: if one part of a 50-part message is corrupted, it's enough to download (or request) just this part. Up till now, all methods used require all 50 parts to be downloaded at once (and often from the same source) at the slightest error. Because of yEnc's universal design, it's possible to get just the missing parts from whatever source and fill the holes with them.

An extreme vision: one user has successfully downloaded 48 parts of a 50-part message. Other users have got only 47 parts - they each have different parts missing. (Usenet likes to "forget" complete messages - for several reasons - another pressing problem.) Now if one user resends the complete message, it's possible to complete the messages with only the needed parts using yEnc. Again, it'll also work if other ways of sending are used - even if other parameters (like splitting size) are used.

Of course, doing this it's necessary to ensure the original data's identity - which can be done by including global CRC32 and size data.

In the current version of yEnc this global check is not yet implemented - but the possible implementation is already taken in account.

** Changing to yEnc, and existing tools

Introducing a new encoding method is senseless if it can't be used right from the start, because all the necessary tools still have to be examined or even altered. With the multitude of tools available, it's impossible to appoint a "milestone moment" on which to start using yEnc - that's why this transition period should be taken special care of.

Posting tools should of course be available - but a couple of elementary tools should be enough. Most important is the ability of users to receive and decode the messages. In the very beginning it's unavoidable to use external decoders for this. And that those decoders should be able to check the data's integrity correctly.

All newsreaders can store messages. Most of them store the complete message, others only the contents. When storing multiple messages at once, sometimes separating characters or lines are put in. Or they're stored in UUCP bag format... Or... Or... The order the messages are stored also depends on the program and settings used.

When storing multipart-messages, often the message headers get lost, including the information about which piece of (corrupt?) data belongs in which message.

yEnc has virtually no problems with all those effects and results: it'll find the data blocks anyhow. It just takes a bulk of data, and extracts complete multiparts from it.

** Ease of programming

When introducing a new method of encoding, programming examples are a necessity. These examples should me as simple as possible, not to scare off potential developers. Also, tools that until now didn't support MIME shouldn't be too hard to adapt. A "simple" expanded UUdecoder is always easier to implement in an autoposter or newsreader than complete MIME functionality.

Without support from the international programming community, there is no new format. And these people are not "formalists" - but "realists". Most of them are freeware/shareware programmers. yEnc won't get anywhere without them, so their wishes should be taken in account. Otherwise, the new format will never get spread and it'll be UUencode forever and ever.

** Adding improvements

After the first introduction of the yEnc concept lots of feedback has come from all possible sides. There are lots of wishes to expand the format, too:

adaptive encoding (preventing overhead with extreme data), full compression, adding PAR information and embedding into MIME are the most important suggestions. As most of these wishes would involve really complex techniques, the wishers are gently asked for patience - let's hope for those improvements in future versions. Especially the possibility of automatic compression is tempting - but really hard to implement, too. Nevertheless, extensions like these should be accounted with. The wish for embedding into MIME is mentioned relatively often, therefore some introducing steps are described.

** Embedding yEnc in MIME

As MIME is widely accepted as a standard format for messages, it should be possible to incorporate yEnc in it. Unfortunately, this is not as easy as it sounds. Opinions differ greatly on how to do it - but all proposals have in common that they contain lines like "is against this rule" - "won't work with that" - "causes that problem" - "don't know" - "is emphatically not recommended".

As there are MIME professionals, meeting in MIME newsgroups, they've been asked for help. Unfortunately these people belong to the "mail community" - they're not really in touch with the specific techniques. Even a simple matter like "what name do we give the data" poses troubles: a MIME decoder unable to understand yEnc will store the data with the given name. This will be the wrong name, as the data is still encoded, and should be stored with another name - otherwise the data can't be decoded into the same directory without overwriting itself. If, however, the encoder would use file names like *.yenc, the MIME type "octet-stream/application" should be used to start an external decoder. Which breaks with the concept of new MIME readers that _do_ know yEnc. The transition period would be simply catastrophical.

There are very good reasons why introducing new encoding methods is not recommended. Unfortunately, we'd be stuck with the overhead for the rest of eternity.

Of course, MIME people often object to the overhead that appears when complete yEnc blocks, from "=ybegin" to "=yend", are themselves encoded in MIME. They suggest to just leave those away, and use only the normalised MIME header. Still, no-one can explain how an external encoder should decode the message without a header. It would also be impossible to link corruptions to specific messages. The usual advise would then be: modify the newsreader. Not really helpful during the transition period.

After all this, one gets the idea of having taken a wrong turn and starts to have a better look at the QP-lite method, which has now been defined. As this only functions with MIME, it should work without any problem. Unfortunately, already existing tools (Agent, Outlook Express) are having trouble processing multiparts. It's necessary to create malformated messages to decode a correct binary, and apart from that, CRLFs get put in where they don't belong. Also, this method is not sympathetic to the "trailers" many news servers add to messages. Another wrong turn, it seems. A decoder which' behaviour is so unpredictable is hardly useful for a transition period or for later improvements. So, it's back to the concept of full embedding into MIME. For that method it doesn't matter what any tool comes up with.

Anyway, my question on how to protect _single_ parts of a multipart from corruption during transmission hasn't been answered by anybody yet. Content-MD5 only secures the entire binary - and its use with multiparts is simply prohibited.

What a MIME decoder should do when it encounters an incomplete binary is unclear. How it should fill the gaps with parts downloaded later, is quite a puzzle. Inserting parts that were obtained in other ways, is completely in the dark.

Still, I'm convinced there must be a way to realise all this. As it would require changing some MIME RFCs, or willfully breaking generally excepted rules, it calls for a lot of patience and the willingness to take some risks.

Sending data on Usenet encoded any old way, in big text messages, is one thing - a sure method to make yourself unpopular. But breaking MIME rules - on which lots of tools depend - is quite another cup of coffee. It'll get you "killed" in no time.

My personal fear is, that some MIME software somewhere can't cope with new transfer encodings - and then (because these program modules have never been tested) crashes. That would make yEncoded MIME messages into a regular DOS attack.

** Definition, alfa and beta tests

After all these experiences - and endless talks, research and lots of e-mails - I sit down and try to compile everything into a nice yEnc draft. Implementing and testing it right away is just one day's work.

The first introduction leads to finding a couple of weaknesses, which result in some fine-tuning of the draft. A small team of 3/4 programmers discusses for a couple of days more, and makes some final decisions.

With the development of encoding and decoding example software, the web site and an external decoder for Windows, yEnc is ready for the first launch. A newsgroup for testing is founded, and a couple of test postings helps to get all tools in tune. Larger test postings in normal binary groups bring up some last problems, which are solved. The people involved message to other programmers about the current events and invite them to implement.

** User acceptance

Usenet users seem to explicitly not care about the way their data is transmitted (MIME or xxxx). They enjoy the fact it's faster - and that it's better in coping with corrupted data. (What they want next is a good compression, to be able to skip the WinRAR, MsSplit or WinZip step.)

** Programmer acceptance

Some programmers of important Usenet software have contacted me. And more and more are doing so right now. It looks like yEnc will be accepted and integrated in widely spread freeware and shareware relatively soon. A reason for much rejoice, of course - for these are the people it's all about. Apparently, programmers are asked directly by users to offer yEnc support. And it's unlikely these programmers would put such a lot of work in it if they thought my idea was utter nonsense.

** What happens next?

A Usenet guru has told me that in his opinion the only possible way of MIME implementation would be to encode the binary data into a new transfer encoding, and trust the regular MIME header for what it's worth.

Of all the things discussed, that would only solve the overhead problem - but it would be a small first step. As someone would have to go through all the troubles of designing an RFC, founding a taskforce, and discuss the matter with all people involved, lots of water will flow into the sea before anything really starts to happen. In this case I can't be the instigator either, as I don't know a thing about the stuff. We must leave it to the real RFC and MIME professionals.

And what will when be implemented by which one of them, is written in the sky.

The next steps for yEnc will be the integrated compression (for which the public domain software fails me) and ensuring security for multiparts by global CRC. It should be possible to improve the completeness of the yEncoded distribution of messages (as it'd be possible to prevent corrupted data to be sent). And multihost newsreaders should yet learn to delete corrupt parts from messages and get them elsewhere.

But, from here it just goes steady on...

--

Jürgen Helbing, 06-12-2001

(Translation: Rob Prins)