1234567890123456789012345678901234567890123456789012345678901234567890 yEncode - Experiences - 27.Feb 2002 ----------------------------------- This is the actual status of the yEncoding after several weeks of use on Usenet and a lot of implementations. I want to thank everybody for their efforts and enthusiasm. Please keep in mind that the proposals for yEnc-V2 are not implemented into any tool - and it is possible that there will never be a yEnc-V2 ! Implementors who did already write an encoder/decoder want to implement the "total=" parameter and check their policy for filenames. This is not mandantory, but recommended. After all the amount of "forgotten" things is nearly null. Forte-Agent ----------- Some subject line formats are preventing Forte (free) Agent from joining yEncoded multiparts automatically. Forte will support yEnc soon - but meanwhile here the fix: I was informed that this switch must be changed in the AGENT.INI file: old: RequireFilenameWithTag=1 new: RequireFilenameWithTag=0 Forgotten things: total=# -------------------------- yEnc draft 1 does not specify the total number of splitted parts within the =ybegin line. It might be possible to get that value from the subject-header (which is not possible for an external decoder when the headers have been already removed). It might be also possible to calculate that value from the size of the first part (=ypart: begin= and end= info). But this is also not a very nice method. Is is now recommended to add a NEW paramater to the =ybegin line for multiparts: The "total=#" parameter which indicates the total number of parts. Example: =ybegin part=1 total=15 line=128 size=500000 name=mybinary.dat Writers of new encoders are honestly asked to add this parameter for multiparts. However a decoder cannot rely on its presence for yEnc-V1! Forgotten things: filename-parameter ------------------------------------- yEnc does not specify the content of the "name=" parameter. And there are a lot of problems if non-ASCII or unicode is used. This is discussed later. But here the important topic: If the "name=" parameter contains leading or trailing spaces then a decoder should CUT THEM. An encoder should not use leading or trailing spaces in that parameter. If the "name=" parameter is included into quotes (name="filename.jpg") then the quotes should be removed by the decoder. Of course most newsreader programmers handle this already. Critical characters ------------------- The practical experience shows that the TAB character can be omitted from the list of critical characters in Usenet. For eMail, stored files (newsreader-behavior) and on other cases it might be necessary to treat SPACEs and TABs as 'critical' characters if they appear at the start or in the end of a line. A careful writer of encoders should encode these two characters whenever they appear in the first or last postion of a line. Generally a decoder must handle ALL escaped characters. It cannot make a plausibility check on these characters - because later versions of yEnc might add more critical charcters. It happens that a programmer is writing an encoder and sends the encoded file _directly_ to an TCP-outputput stream. In such cases he _must_ take care of the DOT in the first column! A dot in the first column must be _doubled_ in any cases. Usually this is done by the transport-layer. But if not, then the programmer can also treat the DOT as a critical character if it appears in the first column of a line. It must be doubled anyway - so there is no bandwidth loss. The same thing happens on the DECODER side whenever the data is directly read from a TCP-input-stream. In this case a DOT in the first column is always followed by a second dot in the second dot. The first dot is then skipped by the decoder. Line Length ----------- The 'default' linelength (line=128) did not cause any problems. So it might be used usually. Special applications might use also line=64. This permits to view a yEncoded message on an 80 column display. It is not recommend to use a smaller value. The recommended maximum line-value is 254. Some Pascal implementations could have problems with longer line. The maximum permitted line-value is 997 for NNTP/SMTP applications. With an escaped character at the end of the line and the trailing CRLF the maximum permitted line-length for these protocols is then reached. Decoders must check the the line-value. They might deny decoding if they cannot handle a line-value - and they should be prepared for a value of 456789012343 ! Subject Line ------------ Most implementations of a yEnc encoder (posting through a newsreader or autoposter) did not respect the subject conventions of the yEnc draft. This is not too bad. But some posting programs are creating confusions with the neticens - and some news-reader (or binary downloaders). It is acceptable to move the part-indicator for a multipart message directly behind the filename: Old: [c1] - "filename" yEnc (#/#) [size] [c2] New: [c1] - "filename" (#/#) yEnc [size] [c2] It is _strongly_ recommended to add the keyword yEnc to all yEncoded posts. Tools which dont add that keyword rely on the neticens to do this - and they sometimes fail. Some programs are permitting the user to add a second (#/#) pair of values in round brackets seperated by a slash. This is really sick. Please prevent your users from adding such things even as _comments_! For single- and for multiparts ! Some programs are using SQUARE brackets [#/#] as a part indicator instead of round brackets. This is really confusing to the users ! Please dont do this for yEncoded multipart messages ! Some programs are using SQUARE brackets [#/#] as an indicator for the amount of posted files (in one run). This is _fine_. But please keep in mind that such counters must be used _only_ in front of the filename - in the [c1] section. Filenames are getting longer and longer. There are a lot of spaces in them today. PLEASE use quotes for the filename in the subject line. This permits users to see wha is coming. And there is no really good reasons why users should use quotes either in filenames - nor in comment lines. Former "nice ideas" ------------------- Adding the CRC32 of a posted binary into the subject line is a bad idea. Setting the "Lines:" header to a faked value to reach similarity to UUencoded posts is also a bad idea. The "Lines" value from the XOVERview is usually recalculated by the news-server which receives the message by POST. Adding additional database information seems to be no topic (yet). So there is no further development of the =ydata lines. If a decoder is finding =y*** lines (outside the =ybegin/=yend block) then it should simply skip them. Size of splitted messages ------------------------- There has been confusion out there among the users of yEnc posting programs about the real size of a message and the distribution quality (which is bad for ultralong posts). UUencode is usually encoding 45 source-bytes into on line - which results in 60 bytes + CRLF for the upstream. yEnc is encoding typically 128 source bytes into one line - which results in 133 bytes (by average). If a message was formely posted with 10.000 lines per section (UU) (620 kBytes) then it is now posted with ~ 4.500 lines to have the same message size on Usenet! The other 'usual' values are: UUe - yEnc msg-size 15000 - 6900 lines 930 kB 10000 - 4500 lines 620 kB 7500 - 3400 lines 465 kB 5000 - 2300 lines 279 kB 3000 - 1300 lines 186 kB Writers of AutoPosters should offer proper information about the real size they are posting. This confusion will end as soon as the "lines" value gets less importance. It should be generally repalced by the message-size. Filenames: length and character-sets ------------------------------------ It is possible that filename gets very long (up to 255 chars). This is not too bad - as Usenet should be able to transport lines up to 1000 bytes - and the filename is not critical. Adding an own line (=yname=) is no real option for me, because this would again limit thew size to 248 characters. Decoder programmers should take CARE about their input buffer whenever they read the =ybegin line ! The concern was raised that filenames could also contain NON-ASCII characters (ISO*, Unicode, ...) This also happens on the subject line then. I have no real solution to this problem. Proposals for extensions of yEnc ================================ A lot of people have _great_ ideas for extending yEnc. And I want to thank all of them for their enthusiasm. We had to postpone all these wishes to yEnc-V2. Here an (incomplete) list of them: Version Number -------------- If we release completely new and different version of yEnc, then existing decoder will have problems with them. So we need a new keyword for detection of yEncoded messages. The proposal would be: =ybegin2 If you are writing a yEnc-decoder then please scan the source data for: "=ybegin " - with a SPACE behind yEnc. If you are scanning for "=ybegin" then you should check if the following character is a space - or a digit (the version number). A better proposal would be: =y2begin PAR - files ----------- These parity volumes are usually posted together with a bunch of binaries for repairing corrupt messages - and restauration of missing parts from the redundancy/difference information in the PAR files. Adding them to yEnc is actually out of sight. The complexity would grow enormously. And most professionals believe this should be kept on user-level. I personally see no reaosn why a newsreader should not create them automatically - and post them together with the source files in one run. Compression ----------- yEncoded files can grow large if exreme data is encoded. If only "critical" characters are included, then the encoded file might have twice the size of the original. Several approaches have been proposed to solve this problem: * Using a variable offset (instead of the '42'). * Using a variable (or different) escape-offset (instead of the '64). * Using a different or variable escape character (instead of '=') All these options could be used in yEnc-V2. Example: =ybegin2 offset=42 escape=61 escoff=64 ... line=128 It might be possible to avoid extreme case - but I'm not sure about their importance. Flexibility with the escape-character would in any case cause problems with the =yend line. Some people are proposing RLE compression (run length encoding) to avoid long sequences of critical characters (which would also blowup the encoded result). As this would ignore double or triple bytes of critical characters as well there seems to be no general solution but to use a "general compression". Some people are favoring a "good standard compression" as BZIP. It is not yet clear which one to use - because it must be public domain (also for commcercial use), fast and easy to implement into various platforms. The general approach would be a "compression-parameter". Example: =ybegin2 comp=zip An encoder would use ZIP before the binary is encoded. A decoder would use ZIP after the binary is decoded. A decoder which does not understand: "comp=guzip" could still save the file wit the extension "*.guzip" and try to call an external application which handles the compression. Multiple binaries in one message -------------------------------- If multiple binaries are stored within one file then the subject line cannot contain all of them. However it makes sense to use at least the FIRST filename in the subject name - rather than sending nothing. If (for example) an HTML-file is encoded togeher with all its included pictures, then the name of the HTML-file file should be displayed in the subject line. Multiple binaries in splitted messages -------------------------------------- Some people are posting 20 pictures in a 10 or 40 part multipart message. Better said: some news-software permits to post such things. Beside the fact that this method to hide filesnames from the readerhsip (and is so mainly used by spammers or trolls or newbies) your news-tool should _reject_ the attempt to post this way at all. Instead smaller files should be placed into one message - and larger files should be split. Please keep in mind: Missing a picture from a series is not too bad. But missing one part of a multipart which is required to decode it at all before you can see what you get is annoying. Constant message size --------------------- Some implementors of yEncoders wanted to have a constant message size for multiparts. They wanted to stop encoding when a particular size of the message is reached. But they have problems to determine the amount of source-bytes as they cannot predict the result-size after encoding. There was the idea to move the "end=" parameter to the "=yend" line - instead of having it i the =ypart line. (I dont like this idea - because then all decoders would have to seek for =yend first). Well - the same result could be reached by encoding to a large memory buffer - then writing the =ypart line - and then writing the buffer. (This would also avoid reading a file twice). Generally it is _highly_ recommended to encode multiparts with a fixed amount of SOURCE bytes. The deviation of the yEnc-overhead is not as drastical as it seems. However the decoder should be prepared to receive also yEncoded multiparts with FLOATING sizes. The real size of a part ca be determined from the =ypart begin=# end=# And the last part would still have end=# identical to size=# Constant amount of lines ------------------------ Some implementors of yEncoders wanted to have a constant amount of lines. The floating number of lines could really disturb users. There are two solutions: Either these implementors do also (as before) stop encoding if a secific amount of lines is used. The same procedure as "constant message size" applies. Or we would let float the LINES size - the length of a line. Then line=128 would specify 128 SOURCE bytes instead of 128 ENCODED bytes. Of course this would require a fundamental change in the encoding - but could still be used in yEnc-V2. I dont beleive that this topic is too important, but it should be also mentioned here to prevent the same questions. Seperate parameter list - or combined parameter list ---------------------------------------------------- Some people suggest to forget the =ypart line and then instead place all into the =begin line. Other people suggest to have seperate lines for every parameters. - no comment, they are all right. MD5-checksum instead of CRC checksum ------------------------------------ Some people propose to use the MD5 calculation for protecting yEncoded message from corruption instead of CRC32. CRC32 was selected originally because this value also appears in SFV/CSV files and is used (sometimes shown) by compressors. I am not sure if CRC32 is too weak for the purpose we have. More practical experience - what can happen if.... ================================================== Calculation of the CRC32 on 64 bit computers -------------------------------------------- Implementors for 64 bit CPUs (and 16 bit cpus) should be aware of the fact that the source code examples are written for a 32 bit machine. The calculation of the CRC32 is _sensible_ to the size of the variables. Please find appropriate source code for your platform - or use the 'usual' #define / typedef methods to guarantee correct CRC calculation. Wrong CRC32 detection --------------------- Nothing is so easy that it cannot be implemented wrongly. (Mc Murphy). Implementors seem to have bigger problems to implement CRC32 for encoders or decoders. Someone told me "there are more intact mesagges with a wrong CRC32 on Usenet than corrupt messages". The result is/was even the wish to switch OFF the crc32 detection to permit the users to store files which would be else rejected as corrupt. I believe that this is _generally_ a bad idea. Implementors should _carefully_ test their encoders/decoders. There is ENOUGH material in alt.binaries.test.yenc every day. Tools which create false CRC32 information must be removed from the net as fast as possible ! An dif you cannot find the bug the it is better to send without the CRC32 (which is not mandantory) rather than with a false one. It might be possible to offer the user an option to store a binary _even_ if it would be corrupt. And still then the reason for the corruption should be added to the filename! It might be possible to see a picture or listen to a voice file even if it corrupt - but it should not be stored as if nothing happened. Recommendation: If a CRC is wrong then store the file with this name: picture(crc-12345678).jpg If a size is wrong the store the file with this name: movie(size-123456789).avi Single parts posted in the multipart format ------------------------------------------- Some implementors want to add always the (1/1) to a single part binary. All I say here is: Why not :-) Some implementors want to send also single part binaries in the full multipart format: =ybegin part=1 total=1 line=128 size=123456 name=binary.dat =ypart begin=1 end=123456 .... =yend I personally believe this is wierd - but implementors of decoders should be prepared for this case. Last multipart is empty ----------------------- Someone created a encoder which created empty last parts. My only comment is: "Shit happens". Be prepared to receive such things if you write a decoder. [EOF] -- Last Changes: =ybegin2 --> =y2begin