Advocacy

  Myths
  Press

Dojo (HowTo)

  General
  Hack
  Hardware
  Interface
  Software

Reference

  Standards
  People
  Forensics

Markets

  Web

Museum

  CodeNames
  Easter Eggs
  History
  Innovation
  Sightings

News

  Opinion

Other

  Martial Arts
  ITIL
  Thought


XML

By:David K. Every
©Copyright 1999


HTML is a way to tag text, using beginning and ending tags around each block, to enable certain format criteria to be applied -- like setting up the font, style, size or alignment of the text, embedding an image in the page, or setting up a link to another file (page). This all allows browser to read the tagged text, and style it in a more pleasant and readable way -- and to allow websites to link many files (pages) together, and to offer you links to other relevant pages on the web, in order to help bind together the entire internet into one large web of pages (where page/site is linked to other information and you can navigate through it all).

The rules are pretty easy as well. Each tag will be surrounded with "<" and ">", and there must be a starting and ending tag. So a sample of html source might look like:

  • <P>This is a test of <B>bold</B>, <I>italic</I>, and <FONT SIZE="+3">large</FONT> text.</P>

Where:

  • <P> and </P> surround plain text (a basic paragraph block).
  • <B> and </B> surround bold text
  • <I> and </I> surround italic blocks of text
  • and <FONT SIZE="+3"> and </FONT> surround sized text

When you view the same text through a web browser, it interprets those commands, to give you something that might look like:

  • This is a test of bold, italic, and large text.

There is a lot more to html in quantity (lots of different tags that you can use to format a page) -- but this is the basic concepts as to how it works and what it is about.

Limitations

There are a few major problems with HTML.

  1. HTML embeds most of the format data (tags) right into the content -- which makes it hard to read or change things easily.
     
  2. HTML doesn't define what things will look like (absolutely) -- it leaves interpretations of the tags (what things should look like) up to the browsers.
     
  3. HTML defines only the formatting of the data (and does that poorly) -- it does not define what the content is or what the block of text is for.

These limitations may appear minor -- but a better solution would make the web (or tagged text) far more useful. Basically what it boil down to, is that the form of the tagged data is so unpredictable and ugly, and the function of what the data is represents is a complete unknown (making it harder to catalog and search for things). But other than it having lousy form, and no function, it is perfect. So each of those items need to be addressed.

Fixing the form (Style Sheets)

In HTML, all the tags are just embedded formatting information. But embedding this information with the content marries the two, and makes it very hard to separate them. If you ever want to change things, you get to search through every pages content and ever tag, and see if that is the right size / color / look. And everything is relative sizing and changes by browsers. It can be very painful (and near impossible) to learn how to tweak the look of a site so it will look good in all browsers -- desingers must compromise at accept "good enough".

If you care about the content of the text, all those tags for size and formatting are just annoying clutter. Also in a large document the tags to define how something looks can get quite long, constantly setting many attributes individually. This can increase the size of an article many times over the original -- which increases download time and just gets in the way.

What we really want is to define a type of text once, like this is my headings, body or a quote -- and define it in absolute sizes, position and look -- and then use that custom tag everywhere. Then everything of a particular type will look the same. This is a style sheet -- what does each "style" of text look like. That one style sheet can be dowloaded once, and then used in all pages on your site -- thus making things faster. If you ever change the style in that one style sheet -- then all your pages can adapt. This is an improvement over normal HTML -- but style sheets are a half step to XML.

XML also uses style sheets. Instead of CSS (Cascading Style Sheets in HTML), in XML they are called XSL (XML Style Sheets) -- but they serve the same function. The point of style sheets is that a lot of the formatting is pulled out of the page, all the formatting is standardized (and thus reduced, compressed and simplified) and the content of a page is left as mostly the text -- with some tags just defining what something is, and the style sheet defining what it looks like. But that idea taken a little further is the key to understanding XML, and why it is so much more significant than HTML.

XML

HTML uses tags to define format information -- but it doesn't explain what the text is. There are a few things defined, like that something is a heading, or something is a quote, or that something is sample code -- but that is just generic information for formatting. These categories (tags) defined by HTML are way too broad. What is this a quote of? What is the heading about? What language is the sample code written in? What does it do? What is the block of text you are formatting really about? That is all the important stuff that the computer wants to know -- because if the computer can know it, then the computer can extract the right information you ask for, and users can easily search for it.

That is the really important issue. Computers don't understand text well -- english, and most human languages are too ambiguous for computers. Computers look for words and not meanings -- and as anyone who has tried to use a search engine knows, it is hard to find just what you want. But computer programs (parsers) can easily read and process tags (a meta-language) -- because the rules are much simpler and descrete than english. You are defining things for the computer, and telling it, "I'm talking about X". So with XML I can mention Dogs in an article about Cats, and because the tags tell the computer the article is about cats, you don't have to see this article every time you search for articles about Dogs. This helps the computer to filter out all that stuff you don't want to see, just because it has a word in it somewhere that matches your search. This is what XML is about -- defining what something is.

XML threw out the predefined tags (by default), and they just kept the syntax rules. You begin tags the same way as html, and end them the same -- and have a few more convenience rules (like some tags can be a beginning and an end at the same time, by just adding a '/' before the '>'). But that is about it. You create tags, you create a style sheet to define what those tags will look like, and you can agree with others on what tags you will use in common, to define what something is.

If a computer learns what a tag is, then it can search for those tags, and behave like it knows what something is really about. Instead of people having to manually run around and add links to each others pages, they can just automatically search for all related articles, far better than they can today. Much of the manual work of the web is removed -- and becomes more automatic. Search engines work better, and people are more productive.

Conclusion

Remember, XML is not a language -- it is a meta-language. It is just the syntax of tags, and how they will be created. It doesn't define what something is -- programmers or users have to. XML defines what tags look like -- user define what each tag is. XML tells the computer (or programs) where that something starts and ends.But as we all agree on what different tags mean, then all computer programs will be speaking the same language(s). This is going to take time, to agree on different sets of tags for different datatypes -- but the process has started, and computers will be far more powerful because of that (in a few more years). As we get XML browsers, and agree on what XML tag sets mean, then we can define what information is (not just how it looks) the web will become a much more powerful tool.

HTML is basically a subset of XML. You can define the HTML language, using the XML metalanguage.

XML isn't just revolutionizing the web -- it is becoming a standard format for all programs to talk to each other. It is a metaphorical universal file format for everything (or the building blocks of one). Databases can share data with each other by using XML. Incompatible computer programs of every sort can export and import XML, and thus be able to talk to each other and share information. Preference files and settings are all being saved in XML, so that power users can read the files and alter things. All sorts of lists, descriptions, and chunks of data are learning how to be streamed in and out of XML -- which is become the defacto Object Oriented meta-language of computing. As we agree more and more on what groups of tags mean, then the vocabulary of XML will continue to get richer, and what we can do with XML (and computer) will keep growing.


Created: 10/24/99
Updated: 11/09/02


Top of page

Top of Section

Home