Happy Fifth Birthday

(XML Turns Five on Monday)

Dave Hollander

C. M. Sperberg-McQueen

XML will be five years old on Saturday; the W3C published XML 1.0 as a Recommendation on February 8, 1998. Since its first introduction, the Extensible Markup Language has become pervasive nearly everywhere that people manage information. With its companion and follow-on specifications, XML Namespaces, the XML Information Set, XSL Transformations (XSLT), XML Schema, XML Linking, and so on, XML has changed not only the way people publish documents on the Web but also the way people manage information internal to their enterprise.

As two original members of the Working Group, we have witnessed many changes, some good, some a little disconcerting. In our continuing seven year effort to define XML, now seems like a good time to reflect for a moment about the hopes which accompanied the development of XML, what has happened since, and what should happen next.

In the Beginning...before XML had a name, a team of twelve people came together for a simple reason and with modest expectations. We were all professionals with significant shared experience both with the World Wide Web and with using computers to process and manage information using SGML, the direct ancestor of XML.

The Web was becoming ubiquitous--we wanted to use it to publish our SGML-encoded information. The ten-year-old SGML made information reusable; its power was its ability to describe information in a way that was independent of the system it was intended to be used on. But SGML was difficult to learn and use, its acceptance was limited to documentation professionals, and it was very difficult to use SGML with the new medium known as the Web. The Working Group formed around the shared belief that the two technologies could be made to work together to make it easier to share and reuse information.

Working under the auspices of the World Wide Web Consortium (W3C), we began by agreeing upon ten goals, which are still listed in the first chapter of the XML specification. The nameless subset of SGML we were developing should be easy to use on the Internet, support a wide variety of applications, be compatible with SGML, and so on. The goal of bringing together the two powerful ideas of the Web and of descriptive markup energized our group and drove us to work evenings and meet by teleconference not only on Tuesday but also Saturday mornings. Whenever we lost our way, someone would ask, "Is this feature necessary for success?" The group worked to transform these goals and experiences into a formal language, a language designed to make sharing reusable information ubiquitous.

SGML on the Web. Just as interchangeable parts drove the Industrial Age, reusable information powers the Information Age. Our shared experience with SGML had taught us that information becomes more valuable when it can be shared and reused. And the Web would let us share information with wider audiences than ever imagined. We knew that SGML was the best approach for reusing the kinds of information we worked with. But we needed to make SGML easier to learn, understand, and implement, while retaining its core values; in short, SGML fit for the Web.

The core value of SGML that we wanted to build into XML is that of descriptive markup. Markup is information inserted into a document that computers use; in the case of SGML, markup takes the form of tags inserted into documents to mark their structure. Descriptive markup uses markup to label the structure and other properties of information in a way that is independent of both the system it's created on and of the processing to be performed on it.

We did not want XML to be a fixed set of tags: we wanted XML, like SGML, to be a meta-language. Meta-languages are languages used to create languages. Meta-languages allow the user to define languages that are relevant to their information. User defined, processing-independent markup is easier to reuse and can be processed in new and often unexpected ways.

Descriptive markup also makes information independent of any particular piece of software. System-dependent and proprietary formats hinder the reuse of information and make the data owner dependent on the vendors whose software can create and manipulate those formats. Like SGML, XML was intended to help information owners escape being locked in to a particular vendor.

With an SGML fit for the Web, it would be easy and reliable for computers (and humans) to use descriptive, structural markup in their documents. By tagging data descriptively, the information owner can make documents into semantically rich data and avoid the kind of presentation-oriented markup used just because it looks right, markup we called "crufty tag salad".

To our surprise, we did it. The 25-page XML specification could be easily learned and implemented. XML is a meta-language that allows you to design markup languages that describes what is important to you. XML provides elements and attributes to capture logical structure and enables semantic understanding. Working "under the radar" we were able to balance features against complexity. The litmus test "is it necessary for success?" helped us create a language fit for the Web.

Before we knew it, all sorts of people started using XML--best of all, doing so without the permission or guidance of the Working Group. In effect, they crashed our party. Database people, transaction designers, system engineers, B2B developers all crashed the party. Why, an outsider even got an article on XML published in Time magazine!

People flocked to talks given about XML, tools were created and not by just a few but also by the largest software companies in the world. The press reported, at first with lots of misunderstanding but later with growing insight into how XML could make its mark on the Information Age.

XML became the standard platform for convergence of information. Rapid growth in the XML community was good luck, because it meant there were a lot more tools than there ever would have been otherwise. But it also had an even more important consequence: information that had been stored in document systems, word processors, and databases were suddenly accessible in the same format and could be processed with the same tools. XML became pervasive nearly everywhere that text-based information is managed by computers. Remarkable!

The forces of Change. Of course, success brought pressures to fix things the Working Group got wrong. Experience working with XML has shown that parts of the design don't work quite as well as other parts. The mechanisms for declaring and using entities, the rules for processing well-formed documents, and the limited possibilities of nesting full XML documents inside other XML documents are all occasionally sources of difficulty. Because XML was so stripped down, it was easy to adopt and extend; because it was so stripped down, adopters almost had to extend it.

And we did. And those other people did, too. The original small Working Group with its common, shared experience gave way to lots of groups with differing goals and backgrounds. XML grew stronger for the new insights. You now have XML + XLINK + XSL + Namespaces + Infoset + XML Linking + XPointer Framework + XPointer namespaces + XPointer xptr() + XSLT + XPath + XSL FO + DOM + Sax + stylesheet linking PI + XML Schema + XQuery + XML Encryption + XML Canonicalization + XML Signature + DOM Level 2 + DOM Level 3.

But it also grew. It grew more complex. It grew confusing. What started as a trim 25-page spec, SGML slimmed down for the Web, now has become a complex set of specs totaling hundreds of pages. These specs describe powerful technologies. But taken as a whole, who would describe them now as slim or trimmed down? Five years ago, XML tools could be developed by a good programmer in a week; now it may take full-time teams of the best programmers to keep up. Usability has suffered a bit.

Now that XML is five, where do we go from here? How do we keep XML useful? Should we add even more functionality, describing it in hundreds of pages of ever-growing complexity? If we do, we risk finding that ever fewer people can understand all the intricacies of XML.

Should we rethink how the XML specifications are layered? Should we fix or change how details such as entities are declared or perhaps eliminate the need for the older DTD syntax? Should we change the rules about where structure can be declared so as to make it easier to nest information? Shall we eliminate or redesign attributes? Or perhaps we need to split everything into two formats, one designed for machine-level processing and the other for humans? All of these could improve XML, but would they address the fundamental complexity that we have created?

Our entire computing architecture is in flux. Not only are the computers themselves changing, but new network computing approaches are continuing the evolution of the Web, grids, peer-to-peer. Ever more distributed processes are raising new questions and providing new opportunities to help manage information glut.

With these new architectures comes an increased need to interact with data. Large numbers of people must have intimate knowledge of information and of how to build systems to manage it. If these people cannot easily understand XML and its companion specs, they will find something else which is slimmer and trimmer.

Can we, the community who are defining XML and building the tools to use it, go back and use what we have learned over the last five years to guide a new round of asking the question asked so often seven years ago -- "is this necessary for success?". Can we use 20-20 hindsight to reach the 80/20 balance we worked so hard to achieve? Can we find the passion and enthusiasm needed to make and keep XML as fit and trim as it was five years ago?

About the authors

Dave Hollander, CTO of Contivo, and Michael Sperberg-McQueen, Architecture Domain Lead of the World Wide Web Consortium, were members of the W3C Working Group which developed the XML 1.0 specification. They serve as co-chairs of the W3C XML Coordination Group and of the W3C XML Schema Working Group.