Pipeline for Semantic Integration

“The asteroid to kill this dinosaur is still in orbit.” [1]

Semantic integration implies successfully harmonizing the semantics of two or more systems. The goal may sound simple but it is very hard to implement in our world of legacy systems, emerging technologies, and highly nuanced subject domains. To design robust, flexible and agile systems that are capable of achieving semantic harmonization, the paper looks back to the 70’s and examines a breakthrough in compiler development. The same design pattern is then applied to processes for semantic reconciliation using today’s technologies.

This paper will provide an overview of the steps needed to achieve the goal and present considerations for implementation. The paper does not attempt to explain the principles of lexical, syntactical or ontological analysis, rather it outlines how these process can be kept separate and applied in sequence to achieve semantic interoperability in today’s IT infrastructure consisting of legacy applications, EAI, B2B and Web Service/ESB/SOA technologies. The paper is intended for architects and designers; footnotes for text processing and semantic terms such as lexical, parsing and ontology are provided for other readers.

In the 70’s, two programs emerged from Bell Labs that radically changed compiler technology: Lex and Yacc. Compilers process a source code file and generate a different type of file. Early compilers were large, complicated programs whose cost of development limited their application. When S. Johnson, M. Lesk and E. Schmidt created Lex and Yacc, the costs of developing compilers went down and compiler technology found its way into new arenas.

In the simple example on the right, a Pascal expression is converted to FORTRAN using Lex and Yacc. The process:

A developer creates instructions that describe character patterns and terminal characters for Lex. Yacc then uses developer provided production rules to create the desired output.

The benefits of using standardized and separate tools were increased code reuse, reliability and decoupling the business logic from grammatical processing. Reuse led to Lex and Yacc being used for new classes of programs such as lint, the Portable C Compiler, and a system for typesetting mathematics (roff) [2]. Revolutionary in its time, roff is a direct predecessor of the word processors we use today.

Semantic harmonization will also require breakthroughs. Achieving the same business result regardless of peculiarities in the source data will require us to effectively understand the underlying structure of the data, the metadata that describes the data, and relationships between the various data and metadata.

There is a growing understanding of semantics and their potential in successfully developing integrated business processes. Semantic technologies such as The Semantic Web, inference engines and description logics are joined by process and rules engines in attempting to clarify the intent behind the data exchanged between systems.

While semantic technologies show promise, seldom have they successfully made the kind of breakthrough results desired. Limited by extensive development efforts needed to implement them, they fail to achieve the level of reuse, reliability and decoupling needed by today’s complex computing environment.

Semantic Pipeline

The goal of the semantic pipeline is to significantly increase levels of reuse, reliability and decoupling. The pipeline describes three processes that operate on data in real-time, but the pipeline most directly benefits the designer and developer. The logical pipeline and design model it enables provides the modularity, separation of concerns, and reuse needed to enable breakthroughs in semantic processing.

Why breakthroughs? Because the interplay of data and behavior—semantics—are being addressed from at least four different directions, each with its own advantages, disadvantages and development processes. Data cleansing, inference-, process-, and rules-engines all require reconciling lexical and syntactic differences. Yet, most of these technologies either hide this requirement or developers must implement technology -specific solutions.

The semantic pipeline directly addresses reuse, reliability and decoupling by creating a set of separate processes that can be used on data from any source to create a simplified and coherent data resource for the semantic technologies to work with. A uniform data source allows semantically aware processes to be more focused and decoupled from technology changes.

The first two processes in the pipeline are like those in Lex and Yacc based compilers; lexical reconciliation resolves formatting differences and schema reconciliation completes the resolution of grammatical differences. With all the detailed grammar peculiarities resolved, the various semantic processes can concentrate on resolving differences in intent.

This pipeline must be more complex than the Lex/Yacc based compiler because semantic integration is more complex. It must overcome more variability in data and processing than a compiler. A compiler can rely on source data to conform to the language’s grammar rules while semantic integration must accept data from a variety of sources, each with its own grammar. And, compilers can rely on a restricted set of semantics, typically expressed as functions and operators such as “IF-THEN-ELSE” and “EQUALS”. Our processing systems have a bewildering variety of unique processing capabilities that must be accounted for.

With this complexity, why separate grammar processing into separate steps? The answer is the same as it was for compilers using Lex and Yacc: to improve reuse, coding efficiency, and decoupling. The pipeline is designed to enable semantic solutions to scale to the complexity of today’s IT environments.Also, fortunately, we can reduce the complexity of the pipeline by leveraging modern technologies, particularly XML.


Lexical Reconciliation

' Lexical analysis is the process of taking an input string of characters (such as the source code of a computer program) and producing a sequence of symbols called "lexical tokens", or just "tokens"’[3]

Lexical reconciliation unifies the way that tokens [4] are identified in input data streams. Each unique data source will have different delimiters such as the space character, tabs, newlines and line-feeds. The formats will also have different expectation as to the way multiple delimiters are handled. Lexical analysis uses regular expressions to classify tokens into terminals and typed data.

Lexical reconciliation, in this pipeline, resolves the lexical differences in various data resources. For example, data from legacy COBOL applications, flat-files created by extract programs and EDI are all reformatted to XML.

Why XML? XML begins with the assertion that all XML is “well formed”—that is all XML data must conform to a single lexical model. All XML processors can lexically parse any well-formed XML document without a priori knowledge. And, XML was designed to be able to describe the structures and nuances found in our systems and text.

The pipeline goals of reuse, reliability and decoupling are well met by using XML as the target for lexical reconciliation. Because all XML data must be well-formed [5], developers are assured that they can:

The Contivo Legacy Adapter and Flat-File Descriptions (FFD) provide the technology needed to perform lexical reconciliation for most of the structured data resources found in our enterprises. FFDs precisely define how to lexically analyze flat files, CSV, COBOL records, EDI and many other legacy formats and convert them into XML. Lexical processing using FFDs can be performed in Legacy Adapter software or in the DataPower XML accelerator hardware which will free your legacy processors from having to perform this compute-intensive step.

Schema Reconciliation

With a common lexical model, we can now turn our attention to reconciliation of differences in markup vocabularies, structure and datatypes. Markup is the data added to data to separate and label tokens. With XML, markup labels form a full vocabulary and can be different for each XML dialect. XML markup can also be nested and grouped in user defined structures. And, each element in vocabulary can be assigned a user defined or intrinsic datatype. In the example on the right, ShoppingCart >, < ProductList>, <Part>, etc. is XML markup added to a simple purchase order.

In the pipeline, schema reconciliation resolves markup vocabulary term, structure and datatype differences in the various data resources. For example, there may be three different XML vocabularies used for purchase orders, such as those from a new web application, a customer using the RosettaNet B2B specification, and another customer using EDI.The reconciliation step will transform these structures to a single vocabulary and structure. Typically the target will be an internal standard or canonical based on the data requirements of internal systems and processes.

The pipeline achieves reuse by aligning the pipeline with the W3C XML Schema specification. Schemas provide the ability to specify a markup vocabulary, structure and datatypes for XML data. With schemas it possible to precisely describe a set of documents that conform to a specific structural model.

While Schemas provide a clear description of a syntactic model, data from different sources will typically conform to different schemas.A single schema can’t meaningfully describe a heterogeneous set of data sources. Schema reconciliation involves understanding the vocabulary of the markup, structural and datatypes constraints of all the sources and transforming them to conform to a uniform model or set of models.

Before schema reconciliation processes can be designed, we must first collect and understand the schemas. Most legacy data is not described with a W3C XML Schema; they use schemas with different grammar, syntax and semantics. Typical schema metadata found in our legacy systems includes paper, COBOL Copybooks, SAP IDocs, Oracle BAPIs, and EDI specifications such as gXML. These schemas need to be lexically, syntactically and semantically reconciled before they can be used [6].

Transforming legacy schemas into a normalized form can be achieved with a variety of tools.  If the legacy schema is informally expressed as in a spreadsheet, the formal schema will have to be created by hand with a schema development tool. Formally expressed schemas such as COBOL Copybooks, BAPIs and IDocs can be automatically converted using tools such as the Contivo Vocabulary Management Solution (VMS).

Once the schemas are available and normalized into XML Schema, developers can identify the relationships between various data elements. The relationships define a mapping which can be compiled into instructions a processor can use to transform source data.

Semantics vocabularies can be leveraged in the mapping processes. The Contivo Vocabulary Management Solution (VMS) provides a semantic dictionary that can be customized by the users. VMS provides automated mapping processes that leverage the dictionary in the development and maintenance of the mappings. The mappings can then be compiled into transforms that convert the various source data formats into the targeted data format. Transforms can be executed in a wide variety of applications, middleware and engines or DataPower’s XML Appliances.

The final step in schema normalization should be validation using XML Schemas. This step can verify that the data conforms to all of the markup vocabulary, structure, and datatype constraints. While schema validation can’t verify the data is meaningfully correct, schema validation can assure the data is structurally sound before beginning semantic level processes.  Validation using DataPower’s XML Appliances can eliminate the processor performance bottleneck that often prevents developers from deploying validation which could be so vital to data quality efforts.

Semantic Reconciliation

With a lexically and schematically consistent data source, developers can address the final step—semantic reconciliation. In this stage, the intent of the actual data is examined and compared against the capabilities of the various processing systems. Context derived from the markup is often used in these processes as are business rules, data quality specifications, process models and ontologies. The goal of semantic reconciliation is to have a satisfactory business outcome even though the data may have widely varying semantics.

Semantic reconciliation processes can account for a wide variety of differences in data. For example, differences in key terms can be resolved by in classifying “Check” and “Cheque” as Payment Instruments”.   Other semantic processes may apply security and business rules, for example to validate that a supplier is authorized to access inventory records for a certain set of products.

Still other semantic processes may focus on behavior and invoke different processes based on the context and values of specific data fields. For example, production orders that consume less than 1 hour of production capacity may only need direct supervision approval while orders that will require more than a week’s capacity may need to be routed to sales and planning for approval.

Technologies used in semantic reconciliation range from well established products to the latest generation of inference engines or “semantic technologies.”  These technologies depend on creating accurate models of the semantics in their systems and in the source data and therefore successful deployment is highly dependent on the quality of the data it receives. With lexical and syntactic reconciliation performed as separate steps, the semantic engines need not detail how to cope with meaningless grammar and parsing differences but rather concentrate on resolving differences in intent and capability.


Separation of lexical, syntactical and semantic processes into a semantic pipeline can significantly increase the level of reuse, reliability and decoupling in the overall implementation. The pipeline can reduce the time and costs necessary to successfully implement complex semantic solutions. Risks attributable to differing interpretations of data are significantly reduced and maintenance becomes more manageable.

Perhaps most importantly, the pipeline provides an environment where the interplay of data and behavior is more clearly visible and manageable. New systems and data sources can be added to the solution, often without changing the semantic layer. Semantic engines can be modified and tested without interruption of existing systems and without having to reprogram all of the details implemented at the lexical and syntactical levels. Such an environment is capable of supporting breakthrough deployments of semantic technologies and delivering on their potential and promise.

[1] Early UNIX manual pages were noted for including such “humorous” passages. This one was also included in the paper:Yacc: Yet Another Compiler-Compiler; Stephen C. Johnson; AT&T Bell Laboratories; Murray Hill, New Jersey 07974; http://dinosaur.compilertools.net/yacc/

[2] ibid

[3] http://www.fact-index.com/l/le/lexical_analysis_1.html

[4] A token is an atomic symbol of the source program. http://www.comsci.us/compiler/notes/ch04g.html

[5] During the design of XML, the working group deliberated long and hard before adopting the “draconian” insistence of that all XML conform to a straightforward set of lexical rules. In my opinion, our decision to require conformance has been a significant factor in the overall success of XML.

[6] Semantic harmonization is recursive. Data is harmonized with the use of metadata and metadata is harmonized with the use of meta-metadata. The meta-metadata is sometimes formally expressed as in OMG’s MOF or informally as in the schema component architecture described in the W3C XML Schema specification.