Project Proposal: An Ontology-based Metadata Management System for Heterogeneous Distributed Databases

Download 152 Kb.
Size152 Kb.

Project Proposal:

An Ontology-based Metadata Management System for Heterogeneous Distributed Databases

CS590L – Winter 2002
Group members:

Quddus Chong, Judy Mullins, Rajesh Rajasekharan

Table of Contents

  1. Introduction………………………………………………………………………3

  1. Project Goal and Objectives…………………………………………………….4

    1. Overall Goal……………………………………………………………….4

    2. Specific Objectives………………………………………………………..4

    3. Significance of work………………………………………………………4

  1. Project Background……………………………………………………………...5

    1. Related Work……………………………………………………………...5

      1. Data mediators and Global Schema Integration…………………..5

      2. XML-based Integration…………………………………………...5

    2. Our approach: Ontology-based Integration……………………………….6

  1. General Plan of Work……………………………………………………………7

    1. Domain Analysis…………………………………………………………..7

      1. Business…………………………………………………………...7

      2. Digital Library…………………………………………………….7
      3. Dublin Core………………………………………………………..7

      4. Scientific and Medical…………………………………………….8

      5. Education………………………………………………………….9

    2. Design of Activities to be undertaken…………………………………....10

    3. Technologies and Tools………………………………………………….11

      1. Java Development Environment…………………………………11

      2. Enterprise JavaBean Component Model…………………………12

    4. Methods and Techniques………………………………………………...13

      1. Metadata Extraction and Object-to-XML binding……………….13

      2. XML and Ontologies…………………………………………….13

      3. EJB and Security…………………………………………………13

    5. Time Table for Project Completion……………………………………...14

  1. Bibliography…………………………………………………………………….16


The emerging global economy can be seen as one force that is motivating research into the interoperability of heterogeneous information systems. In general, a set of heterogeneous databases might need to be interconnected because their respective applications need to interact on some semantic level. The main challenge in integrating data from heterogeneous sources is in resolving schema and data conflicts. Previous approaches to this problem include using a federated database architecture, or providing a multi-database interface. These approaches are geared more towards providing query access to the data sources than towards supporting analysis.

The types of data integration can be broadly categorized as follows:

  • Physical integration – convert records from heterogeneous data sources into a common format (e.g. ‘.xml’).

  • Logical integration – relate all data to a common process model (e.g. a medical service like ‘diagnose patient’ or ‘analyze outcomes’).

  • Semantic integration – allow cross-reference and possibly inferencing of data with regards to a common metadata standard or ontology (e.g. HL7 RIM, OIL+DAML).

Metadata is the detailed description of the instance data; the format and characteristics of the populated instance data; instances and values dependent on the requirements/role of the metadata recipient. Metadata is used in locating information, interpreting information, and integrating/transforming data. Being able to maintain a well-organized and up-to-date collection of the organization’s metadata is a great step towards improving overall data quality and usage. However this task is complicated by the different quality and formats of metadata available (or not) from the heterogeneous data sources, and the consistency in updating existing metadata. A more complete classification of types of metadata by application scenario and information content is given in [47].
An ontology is a explicit specification of the conceptualization of a domain. Information models (such as the HL7 RIM [9]) and standardized vocabularies (such as UMLS [8]) can be part of an ontology. Ontologies allow the development of knowledge-based applications. Benefits of using ontologies include:

  • Facilitate sharing between systems and reuse of knowledge

  • Aid new knowledge acquisition

  • Improve the verification and validation of knowledge-based systems.

This paper proposes a lightweight approach for the semantic integration of heterogeneous data sources with external domain-specific models, using ontologies. The technologies that will be used to implement this project include: Enterprise JavaBeans (EJB), XML Schema, XML Data Binding, JDBC Metadata, Java Reflection API, XML Metadata Interchange, Resource Description Format (RDF) Schema, LDAP, and XML Stylesheet Transformation (XSLT).

2. Project Goals and Objectives

Overall goal

The overall goal of this project is to develop a knowledge engineering tool to allow the knowledge engineer to specify semantic mappings from a local data source to an external data standard. A common concept model (ontology) is used as the basis for inter-schema data mediation. We are interested in the notion of providing this tool as an online web-service.

Specific Objectives

There are two aspects to this work, namely the Knowledge Engineering (KE) requirements and the System Development requirements. From the KE perspective, the system needs to support:

  • Knowledge modeling – should be able to associate terms from a local schema with concepts in the abstract ontology, and be able to specify the relationships between attributes in the data models of participating data sources.

  • Knowledge sharing – the system allows the exchange of information between data sources by providing a mapping and translation mechanism.

  • Knowledge reuse – external data standards become a common reference point for different local data sources. The system makes the standards accessible and reusable to Knowledge Engineers who are integrating their separate data sources, or migrating their local schema to the standard.

The system architecture should be designed and developed to ensure that the system meets these functional requirements:

  • Distributed – the service is provided to data sources distributed over a network.
  • Interoperable – the storage system for the data sources are assumed to be heterogeneous, namely existing on different platforms, or using database management systems from different vendors. The service should interoperate with the native storage systems easily, without having to be modified extensively for each type of data source platform.

  • User friendly – the service should allow the Knowledge Engineer to be able to perform the selection of data sources and foreign schemas, specify the mappings between them, and other operations such as saving the file, through the use of a visual interface.


Electronic data exchange is the key goal driving the development of today’s networks. The Internet has made possible the sharing of electronic resources across multiple remote hosts for the purpose of information processing. Information systems today often involve processing more than a single data source. Systems designed for diverse areas such as online retail, bioinformatics research, and digital libraries rely on the coordination and accessibility of heterogeneous and distributed databases. The disparate data sources may be modeled after and closely correspond to the various real-world entities encountered in the domain. As the conceptualization of the real-world entities change, as in the case of updated scientific vocabularies or business workflow reengineering, the structure of its corresponding data source representation must be modified to reflect these changes. Integrating these heterogeneous data sources to provide a homogenous interface to information system users, or user groups, currently poses a challenge to designers of such system architectures. Moreover, meeting this challenge would also go towards establishing the design issue needs of future systems, with the growing trend towards the development of open architectures to support interchange and collaboration between multiple information providers, as currently seen in the emergence of Community-based Systems and by the Semantic Web movement [7].

3. Project Background

Related Work

Since the explosion of the Internet, there has been a proliferation of structured information on the World Wide Web (WWW) and in distributed applications, and a growing need to share that information among businesses, research agencies, scientific communities and the like. Organizing the vast quantities of data into some manageable form, and addressing ways of making it available to others has been the subject of much research.

Research efforts have focused on a variety of problems related to data management and distribution, including: creating more intelligent search engines [24], integrating data from heterogeneous information sources [28] [27] and creating public mechanisms for users to share data through metadata descriptions. [31][25]. In all of these areas, there has been an effort to employ the semantics of data to produce richer and more flexible access to data. Prior to the advent of XML (the eXtended Markup Language), the problem of data management was addressed in different ways, including the use of artificial intelligence [32], mediators [29][28] and wrappers [31][30][24].

Data mediators and global schema integration

The idea of a mediator is that the schemas for each information source (e.g. database) are integrated in some way to generate a uniform domain model for the user. The mediator then "translates between queries posed in the domain model, and the ontologies of the specific information sources." This, of course, requires the mediator to have knowledge of the description of the contents of the database. Pre-XML solutions relied on the ability to obtain this knowledge directly from database managers [28] or from the application of machine learning methods [32]. A generic database connectivity driver, such as JDBC, allows a database to be queried through a remote connection, and metadata information to be generated.

Wrappers are programs that translate data in the information source to a form that can be processed by the mediator system's query processor. In other words, the wrapper converts human readable data to machine readable data. [27] Among other things, a wrapper can rename objects and attributes, change types and define relationships. Such data translations can now be done with XML by using XML Data Binding techniques. [26][22] (We will say more about this later.)

Creating public mechanisms for making information available to others is the subject of [31]. Mihaila and Raschid propose an architecture that "permits describing, publishing, discovery and access to sources containing typed data." The authors address the issue of discovering and sharing collections of relevant data among organizations in related disciplines (or application domains). This research forecasts the current demands of business, academia and the scientific community, among others, to provide access to an intelligent integration of information in the form of metadata.

XML-based integration

Most solutions described in pre-XML research (prior to 1999) are now obsolete in terms of their usefulness, since XML-based applications have solved some of the problems that were addressed prior to 1999 vis-a-vis retrieving and manipulating data in heterogeneous sources. More recently, Michael Carey et. al. have capitalized on XML technology by proposing a middleware system [23] that provides a virtual XML view of a database and an XML querying method for defining XML views. Their XPERANTO system "translates XML-based queries into SQL requests, receives and then structures the tabular query results, and finally returns XML documents to the system's users and applications." With EXPERANTO, a user can query not only the relational data, but also the relational metadata in the same framework.

Similarly, the Mediation of Information using XML (MIX) [1] approach is motivated by viewing the web as a distributed database and XML as its common data model. Data sources export XML views of their data via DTDs as well as metadata. Queries on the component data sources are made with a XML query language (XMAS). The use of a functional data processing paradigm (XSL and XQuery) currently has limitations in that searching and querying has to be formulated in the XPath syntax, but has the advantage that it can change and access deeply nested recursive data structures easily.
Our approach: Ontology-based data integration

We are investigating in this project how to extract metadata from relational data sources and transform the metadata to XML. The solution to this problem will be the first step in developing an extensible and adaptable architecture to perform integration of heterogeneous data sources into a data warehouse environment using an ontology-based data mediator approach -- which is the final goal of our project.

Ontologies are seen as a key component in the next-generation of data integration and information brokering systems. The DataFoundry approach [3] uses a well-defined API and an ontology model to automatically generate mediators directly from the metadata. The mediator here is implemented as a program component with C++ classes derived from the ontology to perform transformations on the local database into a common data warehouse format.

The work by [4] aims to resolve semantic heterogeneity (i.e. differences or similarities in the meaning of local data) by using ontologies. Hakimpour and Geppert argue that semantic heterogeneity has to be resolved before data integration takes place; otherwise the usage of the integrated data may lead to invalid results. In their approach, databases are 'committed' to a local ontology (derived from local database schema). These different ontologies are merged via a reasoning system (such as PowerLoom), and a new integrated schema is generated. The ontologies are merged by establishing similarity relations between terms in the ontologies. By using the similarity relations discovered, an integrated schema can be obtained by applying rules to derive integrated class definitions and class attributes.

An example of a knowledge modeling tool that uses ontologies is WebODE [43]. This is a web application with a 3-tier architecture that supports ontology design based on the Methontology methodology. Its underlying services are provided via a customized middleware called the Minerva Application Server, which is CORBA-based.

Finally, a good discussion of issues related to information integration with ontologies is given in [44]. It is pointed out that schema-level standards such as XML Schemas and DTDs do not solve entirely the problem of semantic heterogeniety because the various schemas may not use consistent terminology for schema labels and does not ensure that data contained in different files that use the schema labels are semantically consistent. A prototype system, the Domain Ontology Management Environment (DOME), is introduced that uses an ontology server to provide translation between source system terminologies and an intermediate terminology. The prototype is implemented as an Enterprise JavaBean.

4. General Plan of Work

Domain Analysis

As a preliminary to our project, we conducted a survey of the usage of metadata and occurrences of metadata interchange within various domains. The domains covered include the business, scientific, medical, and education fields. We present our findings below:


Metadata management offers sevaral benefits in the business domain including:

  • Simplify integration of heterogeneous systems

  • Increased interoperability between applications, tools, services

  • Greater reuse of modules, systems, data

  • An enabler for a services-based architecture

  • Common models needed for sharing services

One of the most important business metadata standards is the Electronic Business XML Initiative (ebXML) [45], jointly developed by UN/CEFACT and OASIS. ebXML offfers companies an alternative to Electronic Data Interchange (EDI) systems which often requires the implementation of custom protocols and proprietary message formats between the individual companies. Because of this, EDI use has been restricted to larger corporations that can absorb the initial costs required to do business in this fashion. The goal of ebXML is to provide a flexible, open infrastructure that will let companies of any size, anywhere in the world, do business together.

Digital Library

One consequence of a wide range of communities having an interest in metadata is that there are a bewildering number of standards and formats in existence or under development. The library world, for example, has developed the MARC (MAchine-Readable Cataloging) formats as a means of encoding metadata defined in cataloguing rules and has also defined descriptive standards in the International Standard Bibliographic Description (ISBD) series. Metadata is not only used for resource description and discovery purposes. It can also be used to record any intellectual property rights vested in resources and to help manage user access to them. Other metadata might be technical in nature, documenting how resources relate to particular software and hardware environments or for recording digitization parameters. The creation and maintenance of metadata is also seen as an important factor in the long-term preservation management of digital resources and for helping to preserve the context and authenticity of resources.

The Dublin Core

Perhaps the most well-known metadata initiative is the Dublin Core(DC). The Dublin Core defines fifteen metadata elements for simple resource discovery; title, creator, subject and keywords, description, publisher, contributor, date, resource type, format, resource identifier, source, language, relation, coverage and rights management. One of the specific purposes of DC is to support cross-domain resource discovery; i.e. to serve as an intermediary between the numerous community-specific formats being developed. It has already been used in this way in the service developed by the EU-funded EULER project and by the UK Arts and Humanities Data Service (AHDS) catalogue. The Dublin Core element set is also used by a number of Internet subject gateway services and in services that broker access to multiple gateways, e.g. the broker service being developed by the EU-funded Renardus project.

Scientific and Medical

In the area of scientific research, data is exchanged between organizations to collect raw data sets for testing and analysis. To support interoperability and provide better access, several metadata standardization projects have been initiated. One example of a government-driven metadata initiative is the Federal Geographic Data Committee (FGDC) [11], tasked with developing procedures and assisting in the implementation of a distributed discovery mechanism for digital geospatial data. Its core Content Standard for Digital Geospatial Metadata (CSDGM) has been extended to meet the needs of specific groups that use geospatial data, including working groups in biology, shoreline studies, remote sensing, and cultural and demographics surveying.

The Unified Medical Language System (UMLS) project [8] directed by the National Library of Medicine aims to aid the development of systems that help health professionals and researchers retrieve and integrate electronic biomedical information from a variety of sources and to make it easy for users to link disparate information systems, including computer-based patient records, bibliographic databases, factual databases, and expert systems. The UMLS project develops "Knowledge Sources" (consisting of a Metathesaurus, a SPECIALIST lexicon, and a UMLS Semantic Network) that can be used by a wide variety of applications programs to overcome retrieval problems caused by differences in terminology and the scattering of relevant information across many databases.

Two noteworthy non-governmental metadata projects related to healthcare are the Clinical Data Interchange Standards Consortium (CDISC) [10] and Health-Level 7 (HL7) [9] standards. CDISC aims to develop a XML-based metadata model to support standard data interchange between medical and biopharmaceutical companies, such as transferring clinical trial case reports or data captured via an electronic data collection (EDC) application into an operational database, from which the data are gleaned for analysis and regulatory submission. This would allow regulatory reviewers, such as the FDA, to more easily view and replicate the submitted analyses.

HL7 represents an effort to define an Electronic Patient Record (EPR) standard for the healthcare industry. In the document-oriented patient record, whether computer- or paper-based, the patient's medical record is represented as a collection of documents. An EPR is a single document that can be used to generate multiple views for a patient’s care life-cycle, ranging from epidemiology reports to insurance and billing claims. EPRs are also seen as the central component for Clinical Data Warehousing [20]. Hence, integrating data from different EPR systems is seen an important challenge.

There are numerous other interchange standards based on XML including MathML [12], the Chemical Markup Language (CML), the Bioinformatics Sequence Markup Language (BSML) [18], and the Extensible Scientific Interchange Language (XSIL) [19]. In general, these are predicated upon the use of common metadata standards for describing objects, properties, and relationships in the specialized scientific domain.


Online education is an area where standards are increasingly important. Evidence of this can be seen in the number of groups working on standards for describing and sharing educational resources in an online environment. Some of these include: the Aviation Industry CBT Committee (AICC), the European CEN/ISSS Learning Technologies group(CEN/ISSS LT), the Education working group of the Dublin Core Metadata Initiative (DC Education), the IEEE Learning Technology Standards Committee (IEEE LTSC), the Instructional Management Systems project (IMS) Global Learning Consortium, and EdNA (Education Network Australia). [35] These groups are all involved in creating standards for interoperability, integration, and the use of the semantic web. This discussion will treat two of these organizations: IMS and IEEE LTSC.

Influencing the standards development of many, if not all, of these groups is the Dublin Core Metadata Initiative (DCMI) [37]: “an international collaborative effort to establish and maintain standards for describing Internet resources with the aims of enabling targeted resource discovery and interoperability of information exchange.” [38] The DCMI defines 15 standard data elements which provide a common core of semantics for resource description. In addition, tools and software are available through DCMI for creating metadata, automatic extraction/gathering of metadata and conversion between metadata formats.

The IMS consortium [34] is involved in a broad scope of work related to developing standards for “repository technology to support the configuration, presentation, and delivery of learning objects” and in the “integration of e-learning with existing and emerging online digital asset services.” Standards are oriented towards the training market and industry. Stakeholders come from Higher Education, K-12 schools and training organizations.[35] The IMS Learning Object Metadata Working Group has developed a standards model derived from Dublin Core.

The IEEE Computer Society's Learning Technology Standards Committee (IEEE LTSC) was chartered to develop standards to facilitate “interoperation of computer implementations of education and training components and systems." [39] Currently the LTSC is composed of several working groups, including a group focused on Data and Metadata. This working group is further decomposed into 3 primary interest groups:

  • Learning Objects Metadata (standards regarding minimal set of attributes needed for locating, managing and evaluating learning objects)

  • Semantics and Exchange Bindings (investigations regarding use of XML and DTDs)

  • Data Exchange Protocols Localization (standards regarding translations and cultural issues

The Learning Objects Metadata standard "will support security, privacy, commerce, and evaluation," but will not address the implementation of these features [40]. The Semantics and Exchange Bindings group began ad-hoc in 1998 to study XML as an emerging internet technology and to investigate its potential relevance to other working groups. They have just released standards for Rule-based Binding Techniques -- techniques for rule-based XML coding bindings for data models. [41] The standard for Data Exchange Protocols addresses data exchange at a finer granularity than HTTP. It defines a protocol and semantics that can easily be implemented in networking applications and can easily be bound to APIs. [36]

This treatment of two education standards organizations demonstrates that there are other applications involving semantic mappings from local data sources to external data standards. We feel that E-Learning is an application area in which our work will fit nicely.

Design of activities to be undertaken

Our initial design for the proposed system is depicted in Figure 1. There are four major components of this design to be implemented:

  1. Obtaining the metadata from the data source – the local data source exports its metadata to the service. A client-side process first binds the object representation of the data source description into a standard metadata format in XML.
  2. Providing the user interface – via the Metadata Viewer component of the service, a UI allows the knowledge engineer to view the metadata obtained their local data source, and select and view one or more external standards. The user dynamically specifies the desired mapping between attributes of the local schema to the properties defined for the foreign schema, and this specification is stored by the Schema Merge component.

  3. Defining the ontology – the ontology is used to optimize the integration between schemas by providing a common semantic reference. Every model in the ontology is associated with a schema in XML format. A privileged user of the metadata manager can create, import, delete and modify the ontology models and their associated schemas from an editor.

  4. Providing the schema transformations based on the mappings specified by the user and associations discovered from the ontology, the Schema Translate component of the service generates the transformation rules for the local data source into the selected external standards. The mappings specification and transformation rules are returned to the user in the form of an XML Stylesheet Language document, which can be used by applications on the client side.

Figure 1. An ontology-based metadata management system for heterogeneous data sources

Technologies and Tools

Realizing a project like this is simplified by a careful selection of tools and technologies to be used in the system implementation. Based on our preliminary design, we have chosen to use the power and flexibility of the Java programming language, the Extensible Markup Language (XML), and the Enterprise JavaBeans (EJB) component model as the foundational infrastructure for our project.

Java Development Environment

We will use Java as the development language for our project. We chose Java over C++ for several reasons [43].

  • Java is available on most platforms; the JVM for a developer's platform can be downloaded from C++ is not as portable as Java.
  • There are more tools available for in Java than in any other language.

  • Java is best suited for writing components. There are more XML components (parsers, XSL processors, conversion, etc.) written in Java than in any other language. In combination with XML, Java is particularly relevant for server-side applications

  • Java has an extensive library, including:

    • Java.awt for graphical user interface development

    • Java.beans for Java components services

    • Java.sql for accessing SQL databases through an interface similar to ODBC

    • Java.servlet for creating servlets

For this project, we intend to use JDBCTM, Sun's standard API for connecting to relational databases from Java. In particular, we will use the metadata portion of the JDBC (the specifics of this will be discussed in the next section). JDBC is an acronym for Java Database Connectivity. Because the JDBC API enables Java programs to execute SQL statements, the program can interact with any SQL-compliant database. Since Java runs on most platforms, and since most relational databases support SQL, it is possible to write a Java application that can interact with heterogeneous database systems. We did not choose ODBC (Microsoft's standard database access method) because it is language dependent. [42]

We are investigating three Open Source products for converting JDBC metadata into XML format: JSX, Jato and Castor.

Java Serialization to XML (JSX) aims to provide a simple and lightweight mechanism for compact serialization of object data that uses only a single method invocation to take in an object and write out its contents as XML (and vice versa). Java objects are serialized as XML elements, and object fields as attributes. Because of its specific purpose, JSX does not require the sophistication of SAX or DOM. It is simpler to use, and its memory footprint is sufficiently small for use in applets.

Jato is an open-source Java API and XML language for transforming XML documents into a set of Java objects and back again. Jato scripts describe the operations to perform and leave the algorithms for implementing the operations to an interpreter. A Jato script expresses the relationships between XML elements and Java objects, freeing the developer from writing iteration loops, recursive routines, error-checking code, and many other error-prone, verbose, and monotonous XML parsing chores.

Castor is an open source data binding framework for Java. It is described as “basically the shortest path between Java objects, XML documents, SQL tables and LDAP directories. Castor provides Java to XML binding, Java to SQL/LDAP persistence, and then some more." Castor will translate either a DTD or an XML Schema.

The Enterprise JavaBean Component Model

The middleware technology that we have chosen for the project is EJB (Enterprise Java Beans). EJB is a server component model for Java and is a specification for creating server-side, scalable, transactional, multi-user, and secure enterprise-level applications. Most important, EJBs can be deployed on top of existing transaction processing systems including traditional transaction processing monitors, Web servers, database servers, application servers, and so forth.

In an n-tier architecture, it does not matter where the business logic is; though in a typical 3-tier architecture, the business logic is normally in the middle-tier by convention. With EJB, however, we can now move our business logic wherever you want, while adding additional tiers if necessary. The EJBs containing the business logic are platform-independent and can be moved to a different, more scalable platform should the need arise. A major highlight of the EJB specification is the support for ready-made components. This enables you to "plug and work" with off-the-shelf EJBs without having to develop or test them or to have any knowledge of their inner workings.

The Enterprise JavaBean component is the Java class (or classes) that represents the business-logic component. There are two types of EJB: Session Beans, which represent a process that will be performed on the server. Since the client will request a service from a session bean, each client will have its own instance of the bean; instances of session beans cannot be shared among multiple clients. Session beans can be separated into two types: stateless and stateful. The second one is the Entity Beans, which map a Java class to a data source. The source could be a single row in a database, an entire table, or some form of legacy data not represented in a database. Each entity bean has a primary key associated with it that identifies the data within. It would be difficult to control changes to multiple copies of the same data, so only one instance of an entity bean exists for any given primary key in a system (even in a distributed system). Entity Beans can be separated into two types: bean-managed and container managed. These types refer to the way the data held in the bean is transferred to the underlying persistent storage. For the project we will be using container-managed beans.

The EJB-based three-tier programming model views a Web browser as the first tier, an application-enabled Web server as the second tier, and enterprise information resources as the third tier. In addition to EJB technology, Java servlet technology, JavaBeans technology, and Java Server Pages (JSP) technology are also implemented in this programming model. In this model, the following responsibilities are assigned to the participating Java components:

  • Java servlets are assigned the role of application "controller"

  • JSP pages handle presentation of data and user interface tasks

  • EJB components provide the mechanism for accessing enterprise information resources

A three-tier design based on EJBs confers several benefits, including:

  • Business logic accessing enterprise data can be encapsulated in reusable, portable enterprise beans.

  • Existing enterprise systems can be integrated as enterprise beans with little or no modification.

  • Run-time services required for enterprise applications, such as transactions and persistence, can be factored out of beans and assigned to the bean container.

  • Servlets that control application flow can be modified without requiring change to EJB components.

  • Servlet code can focus on application control logic without regard to presentation of data.

  • JSP pages can generate presentation information mixing static and dynamic content.

  • System components written in the Java language are portable to any platform with a JVM.

Methods and Techniques

Metadata Extraction and Object-to-XML-binding

Our approach to metadata extraction from the data source will be two-phased: First, a JDBC connection to the data source is made and the DatabaseMetadata and ResultSetMetadata interfaces in the java.sql package will be used to extract the metadata as class objects from the databases. According to the class documentation, DatabaseMetadata "provides information about the database as a whole." ResultSetMetadata is used to inspect what kind of information was returned by a database query or a method of DatabaseMetadata.

The next phase will involve Data Binding to translate the java objects returned by DatabaseMetadata to XML Schemas. XML data binding is the translation (marshalling) of XML documents to objects and back again. Numerous products are available for this purpose. However, the direction of the data binding is first from XML to objects. We must translate from objects to XML. It is not yet clear if one of the products we are interested in will be useful in this endeavor.

A preliminary review suggests that Castor may be the best product to use. Our aim is to produce XML schemas rather than XML DTDs in this process. The content of a DTD is limited to text. DTDs provide no mechanism for indicating repetition constraints. There are other deficiencies that schemas correct [21].

XML and Ontologies

To exchange information efficiently, database administrators/knowledge engineers on local systems have to provide a migration path from their local data schema to an external industrial interchange standard. Currently, many of these standards are specified using XML Document Type Definitions (DTDs) or XML Schemas [13]. We will look at how local source metadata can be imported into a Resource Description Framework Schema (RDFS) [14] format that identifies its schema attributes and constraints. A common ontology framework is used to model, view, and maintain domain-specific concepts. The ontology also models the mapping relationships between entities in the local schema and the external exchange standards (foreign schemas). Based on the mapping information in the ontology model, we generate transformation rules that indicate how the attributes in the local schema should be migrated to the semantically corresponding property in the external standard schema. Because the local schema and the external standard have XML representations, the transformation rules can be encoded as XSL for Transformations (XSLT) [15]. Note that a one-to-many mapping is also possible, as the attributes of the local schema can be corresponded to elements from several standard schemas. We can use the XML Namespaces mechanism to keep the correspondences unique.

EJB and Security

One additional advanatge of using EJB is the security features it provides [46]. Much of EJB security is concerned with authorization. EJB authorization is based on a simplified CORBA security model, which asks whether an authenticated principal (or group of principals) is authorized to invoke a method accessible via the ORB. Also, EJB security is about the process of deploying an application so that it can be secure. As such, EJB authorization is from the perspective of each EJB security role.

Time Table for Project Completion1

Estimating the time to complete this project will be a guess, at best. Since we have no prior applications of a similar nature to use as a guide, we will have to rely on knowledge of our general programming skills to estimate our time to completion.

There are two approaches to development: traditional OOA&D, with emphasis on design up front, and Extreme Programming (XP), with emphasis on "design as you go." Since our time is rather constrained, we are leaning towards the XP approach for rapid development of a prototype. The XP philosophy on Estimation is to (1) keep it simple, (2) use what happened in the past and (3) learn from experience. We will be able to "keep it simple," but we aren't able to meet the other two criteria.

The XP approach uses what traditionally is called the “bottom-up” approach to estimation. Individual components of the project are estimated instead of the entire project. The OO paradigm makes this easier, since the project is usually decomposed into interacting class objects.

In this case, we would estimate the size of a story. However, story estimation is based on the actual time spent implementing similar stories in other projects. Hence, we’re back to the same problem. One philosophy in XP is that, periodically, every story will be reestimated -- giving us a chance to incorporate changes that we have encountered (like technologies that turned out to be difficult). Therefore, having no historical data to use, we will record our time during the first iteration, and use that as the basis for our subsequent estimates.

Our approach will be “If this story is all I had to do, and I had nothing else to do, how long would I expect it to take?” Units of estimation may be weeks, days or even hours (minutes are unlikely).

We will probably start with 2-week iterations. An iteration is the telling of “yesterday’s weather.” At the end of each iteration, we measure how much we got done, and then assume we’ll get the same amount of work done in the next iteration. Hence, an iteration is a chart of progress as well as an estimating tool. Time will be recorded during an iteration in terms of “ideal” time: time without interruption during which we can concentrate on our task. Ideal time is time spent on tasks for which one has personal responsibility. For instance, it does not include time spent in pair programming. The process of measuring time spent on an iteration is as follows:

  • At the end of the iteration, we record how many days (weeks) of ideal time each story required.

  • We add up the ideal time in all the stories.

Our first job will be to create and order the stories. In the beginning, we do not need to consider dependencies (e.g., producing the GUI for the application). The first stories that we do should be those which will yield the most “business value” to the customer (stakeholder). We have no customer (except, perhaps, Dr. Lee, who is not likely to be interacting with us on a constant basis!), so we’ll probably entrust customer status to Quddus Chong. He will be responsible for writing the stories.

Our first task will focus on infrastructure:

  • get the testing framework working

  • get the automated build structure working

  • get the appropriate permissions set up on the network

  • get the appropriate software installed and running

We know that our final release date is around May 10. We may have smaller releases before then, based on what our “customer” asks for. Quddus will decide which stories to place in a release and which stories to defer to a later release. We will track our iterations and releases using index cards. Each card will have the following information:
Story Time Estimate (ideal weeks) Assigned Iteration # Assigned Release #
Assuming we do two releases, a best guess as to our completion dates will be:
Event Date

Start 22 Feb 02

Release 1 31 Mar 02

Release 2 10 May 02

This, of course, may be revised when our customer confers with us about desired releases. The release dates are the customer’s decision. Iteration dates are the programmer team’s decision.

Throughout our development process, we will practice XP principles. Our first iteration planning meeting will take place on Feb. 22, 2002.

[1] C. Baru, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, V. Vianu. “Features and Requirements for an XML View Definition Language: Lessons from XML Information Mediation”. Online (Available:, 1999.
[2] A. Bouguettaya. B. Benatallah, A. Elmagarmid. “Interconnecting Heterogeneous Information Systems”. Kluwer Academic Publishers, Boston, 1998.
[3] T. Critchlow, M. Ganesh, R. Musick. “Meta-data Based Mediator Generation”. Conference on Cooperative Information Systems (page 168-176). Online (Available:, 1998.
[4] F. Hakimpour and A. Geppert. “Resolving Semantic Heterogeneity in Schema Integration: an Ontology Based Approach”. In Proceedings of ACM International Conference on Formal Ontology In Information Systems (FOIS-2001). Online (Available:, 2001.
[5] S. Madnick. “Metadata Jones and the Tower of Babel: The challenge of Large-Scale Semantic Heterogeneity”. In Proceedings of IEEE Meta-Data Conference, 1999.
[6] S. Ram. “Guest Editor’s Introduction: Heterogeneous Distributed Database Systems”. Special Issue on Heterogeneous Distributed Database Systems, volume 24:12 of Computer. IEEE Computer Society Press, December 1991.

[7] W3C Semantic Web WWW page. (Available: Current as of February 14, 2002.

[8] NLM Unified Medical Language System (UMLS) WWW page. (Available: Current as of February 13, 2002.
[9] Health Level 7 (HL7) WWW page. (Available: Current as of February 18, 2002.
[10] Clinical Data Interchange Standards Consortium (CDSIC) WWW page. (Available: Current as of February 7, 2002.
[11] Federal Geospatial Data Committee Clearinghouse (FGDC) WWW page. (Available: Current as of May 17, 2001.
[12] W3C MathML WWW page. (Available: Current as of January 2, 2002.
[13] W3C XML Schema WWW page. (Available: Current as of January 7, 2002.
[14] D. Brickley and R.V. Guha. (ed.). W3C Candidate Recommendation Resource Description Framework (RDF) Schema Specification 1.0. Online (Available: March 27, 2000.
[15] W3C Extensible Stylesheet Language (XSL) WWW page. (Available: Current as of January 31, 2002.

[16] Chemical Markup Language (CML) WWW page. (Available: Current as of July 22, 2001.

[17] T. Bray, D. Hollander, and A. Layman (ed.). W3C Recommendation Namespaces in XML. Online (Available: January 14, 1999.
[18] Bioinformatics Sequence Markup Language (BSML) WWW page. (Available: Current as of February 18, 2002.
[19] Extensible Scientific Interchange Language (XSIL) WWW page. (Available: Current as of February 18, 2002.
[20] T.B. Pedersen and C.S. Jensen. “Research Issues in Clinical Data Warehousing”. In Proceedings of the 10th International Conference on Scientific and Statistical Database Management, 1998.
[21] J.Bosak, T.Bray, D.Connolly, E. Malor, G. Nicol, C.M. Sperberg-McQueen, "W3C XML Specification DTD,"

[22] S.Brodkin, "Use XML Data Binding to Do Your Laundry", JavaWorld, Dec. 2001,

[23] M.Carey, D.Florescu, Z.Ives, Y.Lu, J.Shanmugasundaram, E.Shekita, S.Subramanian, "EXPERANTO:Publishing Object-Relational Data in XML", WebDB (Informal Proceedings), pp105-110, 2000. url:

[24] L.M. Haas, R. J. Miller, B. Niswonger , M. T. Roth, P.M. Schwarz and E.L. Wimmers, "Transforming Heterogeneous Data with Database Middleware: Beyond Integration", IEEE Data Engineering Bulletin, vol.22, num.1, pp31-36, 1999.

[25] "Introduction to UDDI", XML Web Services Resources, June 2001

[26] JSR 31 XML Data Binding Specification,
[27] A.Levy, "The Information Manifold Approach to Data Integration", IEEE Intelligent Systems, 1312-16, 1998.
[28] A. Y. Levy, J. J. Ordille, "An Experiment in Integrating Internet Information Sources”, in AAAI Fall Symposium on AI Applications on Knowledge Navigation and Retrieval, Cambridge, MA, November 1995.
[29] A.Levy, A.Rajaraman, J.Ordille, "Querying Heterogeneous Information Sources Using Source Descriptions", Proceedings of the Twenty-second International Conference on Very Large Databases, VLDB Endowment, Saratoga, Calif., Bombay, India, pp251-262,1996.
[30] W.May, R. Himmeroder, G. Lausen, B. Ludascher, "A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web",
[31] G.A.Mihaila, L.Raschid, "Locating Data Repositories Using XML",
[32] M.Perkowitz, O.Etzioni, "Category Translation: Learning to Understand Information on the Internet". In Working Notes of the AAAI Spring Symposium on Information Gathering from Heterogeneous Distributed Environments. American Association for Artificial Intelligence, 1995.

[33] B.Spell, "Enhancing Database Code With Metadata", JAVAPro, June 1999,

[34] "About IMS",
[35] P. Bacsich, A. Heath, P. Lefrere, P. Miller, “The Standards for Online Education” , D-Lib Magazine, vol.5, no.12, Dec. 1999
[36] Data Exchange Protocols Working Group, IEEE Learning Technology Standards Committee,
[37] Dublin Core Metadata Initiative,
[38] EdNA Metadata Homepage,
[39] IEEE Learning Technology Standards Committee,
[40] Learning Objects Metadata Working Group, IEEE Learning Technology Standards Committee,
[41] Semantics and Exchange Bindings, IEEE Learning Technology Standards Committee,
[43] B. Marchal, XML by Example, Que Publishing, 2000.
[43] WebODE WWW page. Available: Current as of June 6, 2001.

[44] Z. Cui, D. Jones, and P. O’Brien. “Issues in Ontology-based Information Integration”. Online (Available: 2001.

[45] ebXML WWW page. Available: Current as of February 7, 2002.
[46] L. Koved, A. Nadalin, N. Nagaratnam, M. Pistoia, and T. Shrader. “Security Challenges for Enterprise Java in an E-business Environment”. IBM Systems Journal. Volume 40, Number 1, 2001.
[47] V. Kashyap, A. Sheth. “Information Brokering Across Heterogeneous Digital Data: A Metadata-based Approach”. Kluwer Academic Publsihing. Boston. 2000.

1 All references to XP principles are from: Kent Beck, Martin Fowler, Planning Extreme Programming, Addison Wesley, 2001.

Share with your friends:

The database is protected by copyright © 2019
send message

    Main page