XML: The scourge of the Internet

August 23, 2004

Not that XML is a new thing nor that it has just started to piss me off just lately, its just that XML has finally started pissing me off enough for me to get off of my lazy ass and write down just how much it is pissing me off.

A lot.

Here I am, minding my own business and trying to build out a new development server... you know, for development. And as all good developers know, you need to sometimes develop with software which has not quite matured yet. Saying nothing of the state of maturity of PHP as a whole (that's a different rant), I decided to download the 5.x series to "give it a go".

Aside from being the mistake that I thought it was going to be, it took the opportunity, and possibly some pleasure, in telling me that my system is not sufficiently advanced to run the default PHP because it doesn't have libxml2 2.5.10 or newer. This coming from a product whose build finishes as follows:

Build complete.
(It is safe to ignore warnings about tempnam and tmpnam).

But, again, this is about XML.

Some number of years ago, not content with creating the horror that was/is/forever will be HTML, the World Wide Web Consortium, a.k.a W3C, set out to try to make us forget their previous blunder by providing us with an even bigger fuck-up: XML

Here's the first paragraph from the XML page at the W3C, annotated for your pleasure:

Extensible Markup Language (XML) is a simple¹, very flexible² text format³ derived from SGML (ISO 8879)⁴. Originally designed to meet the challenges of large-scale electronic publishing⁵, XML is also playing an increasingly important role⁶ in the exchange of a wide variety of data⁷ on the Web and elsewhere⁸.

Useless
Unusable
You should have just used an ASCII text file
We're gonna try to pin the blame on SGML
Pornography
As programmers become unable to program
Like configuration files
Actually, not really on the web that much

XML configuration files. For what? Portability? No. Its because programmers (mostly Java programmers, yet another rant) are either too lazy or completely lack the skills to write a text file parser. I guess that's what happens when you develop a code base around a language which depends on programmers having to be able to put up with more bullshit than having to demonstrate programming skills.

But its a well known, well formatted configuration file

Number 1: It is not well known. You have to add your own 'extensions' which takes twice as long for somebody to figure out because all of your variables and data are hidden in superflous tags which are there more for the convenience of the parser than they are there to be useful. Number 2: Yes! It does have to be well formatted, cause guess what, you would have no chance at parsing it otherwise! Just like a regular text configuation file! Properly format a configuration file and you won't have any of these problems.

And then you have to bitch and moan to the user if the XML file isn't well-formed. GRRR! Can't I just use the data if I can figure out what it is? No, of course not, because just because the computer thinks it knows what's going on doesn't mean that the person who wrote the configuration file does. Its just way too easy to leave off a quote and just completele fuck the parser.

Here's an example:

MyButt sexy
MyChest huge

Now think about this for a second. All the data in the left cell is in right cell. Aside from precisely what is in the left cell, all the rest of the characters in the right cell are unnecessary. ALL of them.

So its a couple of extra bytes. Who cares?

Well, me for starters. This is the thin end of the wedge. A couple of bytes in the case of 2 body parts expands to kilobytes and megabytes once you start getting a lot of entries. And I don't know about you, but I already download far more crap than useful information on the web as it is, I don't need somebody to pad the number.

Now, think of the poor computer which has to parse this file over, and over, and over again. Its not enough that we torture the poor beast into doing stuff that it is not good at, i.e. anything but math, we also have the make it even harder with a text format which is as hard to parse as possible. Here's some C code for parsing each (untested):


 struct bodyparts {
     char *butt;
     char *chest;
 };

 struct bodyparts mybody;


 /* Read the file line by line, cause I made it that way */
 i = 0;
 while(!feof(conf)) {
     fgets(line[i], BUFSIZ, conf);
     i++;
 }

 for(j = 0; j < i; j++) {
     /* Grab the id */
     id = strtok(line[j], " ");
     if(id == NULL) {
         printf("Parse error on line %d\n", j);
         exit(1);
     }

     /* Grab the value */
     val = strtok(NULL, " ");

     /* Check the body part, and assign it in the struct */
     if(!strcmp(id, "MyButt")) {
         mybody.butt = val;
     } else if(!strcmp(id, "MyChest")) {
         mybody.chest = val;
     } else {
         printf("Unknown body part '%s' on line '%d'\n", id, j);
         exit(1);
     }
 }

 /* Since I appear to have some space, no need to waste it by 
    delaying the rant.

    Notice that I make no mention of memory management, aka free(),
    malloc(), realloc(), etc in this code.  You will have to figure
    that out on your own.  Trust me, its going to be twice the 
    nightmare when parsing the XML.

    Also, nominally for the XML, there's a function called 
    strip_quotes(), to remove the quote marks around the values in the 
    tag.  Nominally, it would return a NULL when value wasn't properly 
    quoted or escaped.

    Of course, you could always use one of them fancy XML parsing 
    libraries which don't know exactly what data you want so they have
    to store all of the tags to make sure that you get it all.  Then
    it is still your problem to figure out if the data you got was
    something that you wanted.

    When it really comes down to it, it takes no more effort or time
    to write a flat configuration file parser than it does to parse
    XML, even if you are using one of the libraries.  And you make it
    easier on the computer by not making it string parse as much as
    possible.  So you end up with a faster running program which is
    easier to debug.  And you don't have to worry about the XML 
    library API changing every week.

    Keep in mind that this is the pathologically simple case, so I 
    can write a very specific parser for the XML.  If the XML were
    any more complex, then I would have to abstract out how the data
    was returned, then check out the data in it.  All the while I 
    would have to worry about arrays, nay hashes, with all sorts of
    helper functions, megabytes of memory, inheritance, unlimited
    rice pudding, etc.

    But then again, what XML user would ever make the XML that
    simple?
 */


 /* Since I don't know if there is more than one tag per line,
    read the whole shebang */
 fread(myconf, sizeof(char), CONFBUF, conf);

 /* Parse it out into individual tags, XXX - This breaks when a '<' is 
    inside quote marks.  Left as an exercise for the reader.  :^) */
 i = 0;
 mtag = strtok(myconf, "<");
 while(mtag != NULL) {
     tag[i] = mtag;
     mtag = strtok(NULL, "<");
     i++;
 }

 for(j = 0; j < i; j++) {
     /* Fixup the tag by stepping on the "/>" at the end */
     mtag = tag[j];
     mptr = strrchr(mtag, '>');
     mptr[-1] = '\0';

     /* Now all I have is the contents of the tag */
     mpart = strtok(mtag, " \t\r\n");
     k = 0;
     while(mpart != NULL) {
         part[k] = mpart;
         mpart = strtok(NULL, " \t\r\n");
         k++;
     }

     /* Now the parts of the tag are in the array 'part' */
     /* Part 0 must be the tag type, or whatever */
     if(strcmp(part[0], "Bodypart")) {
         printf("Poorly named tag '%s' (%d)\n", part[0], j);
         exit(1);
     }

     /* Okay, lets see if we can't get some data */
     mid = NULL;
     mval = NULL;
     for(l = 1; l < k; l++) {
         lhs = strtok(part[l], "=");
         if(lhs == NULL) {
             printf("Parse error: tag %d, part %d\n", j, l);
             exit(1);
         }

         rhs = strtok(NULL, "");
         if(rhs == NULL) {
             printf("Parse error (rhs): tag %d, part %d\n", j, l);
             exit(1);
         }

         /* Make sure that this tag is intended */
         if(!strcmp(lhs, "id")) mid = strip_quotes(rhs);
         else if(!strcmp(lhs, "value")) mval = strip_quotes(rhs);
         else {
             printf("Unknown tag ident '%s'.  Tag %d, part %d\n",
                 lhs, j, l);
             exit(1);
         }
     }

     /* See if we have all the data we need */
     if((mid == NULL) || (mval == NULL)) {
         printf("Parse error: Insufficient data in tag %d\n", j);
         exit(1);
     }

     /* Make sure that 'id' is a valid body part */
     if(!strcmp(mid, "MyButt")) {
         mybody.butt = mval;
     } else if(!strcmp(mid, "MyChest")) {
         mybody.chest = mval;
     } else {
         printf("Unknown body part '%s' on line '%d'\n", id, j);
         exit(1);
     }
 }

I've put in almost 15 minutes of thought into this, which apparently is more time than the people who use XML have put in. These are the same people who complain that computers just don't work right. If you abuse the computer, the computer is going to abuse you back.

I use XML for RSS

Well, that's more like what the description of XML from the W3C says. Which makes you wonder why that paragraph isn't reserved for RSS. Okay, lets see what they say about it:

RSS 1.0 ("RDF Site Summary") is an RDF Vocabulary¹ that provides a lightweight² multipurpose³ extensible⁴ metadata description⁵ and syndication format⁶. In short⁷, its a means for describing news and events⁸ so that they can be shared across the web⁹.

Cause XML by itself wasn't good enough. Plus, the name says so.
XML -> RDF -> RSS... yeah, lightwight.
We'll figure out other purposes later
Just like XML
More stuff to parse
Pushing your crap that nobody wants to read
We don't really know how to describe it
The stuff in your live journal that nobody cares about
Except for that part where it has to be downloaded, parsed, and reformatted so that your technologically advanced browser, which doesn't understand what '-1px' means, doesn't make it look like shit.

Once again, they're taking the opportunity to fill your data stream with superflous tags instead of something that the computer has an easy time understanding. Here's a sample of an RSS feed, and the way that I would do it because I actually like using my computer instead of waiting for it to parse XML.

 <item>
  <title>XML: The scourge of the Internet</title>
  <link>
   http://users.electromagnetic.net/bu/rants/xml.php
  </link>
  <description>
   Stop being a pussy.  Be a man and parse your own god-damned 
   text files. 
  </description>
  <pubDate>Mon, 23 Aug 2004 15:20:33 EDT</pubDate>
 </item>

%
XML: The scourge of the Internet
#
http://users.electromagnetic.net/bu/rants/xml.php
#
Stop being a pussy.  Be a man and parse your own god-damned 
text files.
#
1093288833
%

Yeah, so you can't put the data in whichever order you want. Waah! Suck on it and just put it down. Put a god damned universal date (in this case UNIX time_t) so that people can actually see what time it was published in their own time zone. Plus, you can actually mmap() this file so you don't have to crawl through it like a toddler looking for his mommy.

Yeah, so its not extensible. Like RSS is extensible. Lemme see, let's extend it to tell people where they can find their butt. Then let's push out the extended RSS definition file so that their XML parser knows what to do with it. THEN WHAT? Their browser or their parser knows what to do with it? No! Good use of that extensibility there.

I shudder to think about the Google ad that is going to be a the top of the page when you're reading this. Needless to say, its gonna be some company trying to sell you a tool kit or training courses or materials on XML. First off, as you can tell, I don't know why anybody would want XML, or why anybody would want to pay money to support the W3C's XML habit. But if you still do insist, go ahead. At least I'll make some money off of it.

Make what you want of this. If you love XML and use it religiously, I'm not here to change your mind: I'm just here to call you lazy and stupid. So, like the RSS feed says: Stop being a pussy. Be a man and parse your own god-damned text files.