Foundations for Transparent

Data Exchange and Integration

Alan Kaplan
Department of Computer Science
Clemson University
Box 341906
Clemson, SC 29634-1906 USA
kaplan@cs.clemson.edu
Jack C. Wileden
Department of Computer Science
University of Massachusetts
Box 34610
Amherst, MA 01003-4610 USA
wileden@cs.umass.edu
The Conference on Scientific and Technical Data Exchange and Integration
December 15-17, 1997
Natcher Conference Center
National Institutes of Health
Bethesda, MD USA

Abstract

Contemporary approaches to data exchange and integration, such as standardized data interchange formats, IDL-mediated data exchange mechanisms or ad hoc wrapping of legacy data repositories, all suffer from a lack of transparency. That is, these approaches all require that users be painfully aware of boundaries between various data repositories and the applications that need to access them. Our research program is directed toward developing the computer science foundations required for transparent data exchange and integration. Major facets of this research program include: Here we summarize the initial results of this research program, outline its current and future directions, and explore its implications for current and future approaches to data exchange and integration.

1. Introduction

Scientific software systems (such as computer-aided design/computer-aided modeling applications, geographic information systems, earth observation systems, and bioinformatic applications) typically create, use and manage large amounts of relatively complex data. Developers and users of such data, however, are faced with a difficult tradeoff. They need to be able to model data so that it can be easily understood and efficiently used by an individual scientific application; at the same time they need the data to be in a form such that it can be used, integrated, and shared by other scientific applications. Scientific applications traditionally have overcome this problem by resorting to relatively low-level techniques, such as using standard data interchange formats, IDL-mediated data exchange mechanisms or ad hoc wrapping of legacy data repositories. Such approaches tend to have various shortcomings. For example, standardized interchange formats typically require explicit translations, which are often inefficient and prone to error. IDL-mediated mechanisms, such as CORBA and DCOM, impose foreign data models that represent a least common denominator type system [KRW97]. Another serious drawback associated with such approaches is that their use generally requires that developers and users be aware of the boundaries between various data repositories and the applications that need to access them. As a result, software based on these approaches that needs to access and manipulate scientific data is difficult to develop and maintain.

Our research is directed toward developing computer science foundations for transparent data exchange and integration, in particular, exploring both theoretical and practical aspects of this problem domain. The primary objective of our work is to hide the boundaries or seams between heterogeneous data repositories or between data repositories and applications that need access these repositories. Developing appropriate theoretical and practical foundations for transparent data exchange and integration results in software and data that it easier to develop, reuse, share and maintain.

In the remainder of this paper, we give an overview of our research program. In Section 2 we outline some formal models of type compatibility, type safety, and name management. We also describe how these models are used to evaluate and compare various approaches to data exchange and integration. In Section 3 we discuss the practical aspects of our work, specifically the development of a new, highly transparent approach to data exchange and integration, called PolySPIN. A collection of prototype, automated tools supporting the use of the PolySPIN approach, as well as our experience with their application, is also described. The paper concludes with a summary and directions for future work in Section 4.

2. Theoretical Foundations

Our work to date on data exchange and integration has focused on the kinds of data supported and used by modern programming languages. Examples of such data range from traditional files to complex user-defined abstract types and object-oriented classes to persistent objects (e.g., [KMRW96,KMRW96, RTW97]). More specifically, we are concerned with interoperability problems in programming languages, e.g., how an application written in a particular programming language accesses data that was defined and created by an application written in a different, distinct programming language. In this section, we give an overview of our work on formal foundations for interoperability.

To improve our understanding and assessment of suitability of different interoperability mechanisms, we have begun to develop a taxonomy of important facets of interoperability approaches. (See [KW96, BKW96] for details.) One critical aspect of this taxonomy involves the time at which the decision to share data components is made. Specifically, three distinct timing scenarios for interoperability decisions can be characterized by the relationship between the relative times at which the sharing or shared components are developed and the decision to share them is made:

File-based approaches and IDL-based approaches, such as CORBA and OLE/COM, are best suited for the easiest case, while the PolySPIN approach (described in Section 3), is best suited for the common and megaprogramming case.

Our work also includes the development of formal models of name management. Name management, i.e., how a computing system allows names to be established for objects, permits objects to be accessed using names, and controls the availability and meaning of names at any point in time, is fundamental to almost all aspects of computing. It is of particular importance in the domain of data exchange and integration since how an application assigns and controls the meaning of names has a direct impact on the ease with which other applications can share that data. At present, applications typically rely on ad hoc mechanisms that hinder sharing and integration efforts.

For example, an application may create a collection of named data files. The bindings (i.e., the name-file pairs) may reside in a single directory or be organized in a directory hierarchy. Other applications that require access to these data files must have, therefore, a precise understanding of the way these data files are named and organized in order to use them properly.

We have developed a formal model of name management called Piccolo, with an operational semantics based on evolving algebras [KW95]. The model allows for the precise specification of some formal properties and analyses of name management approaches. Using Piccolo, we can precisely express the name management requirements for applications, thus facilitating the sharing and integration of (named) data.

We are also developing a suitable formal foundation, based on concepts from signature matching and object-oriented type theory, that can aid in reasoning about, and implementing support for cross-language type compatibility [BRW97,B98]. Programming language type systems are extremely rich and flexible mechanisms for describing complex data and their relationships. However, once an application has defined and created complex data using a specific programming language's type system it is extremely difficult to access that data from applications written in different languages. Our formal model allows developers to precisely express compatibility relationships between object-oriented types that are defined in different programming languages.

For example, several C++ and CLOS (Common Object System Lisp) applications may define and create geographical maps using the native type systems of the languages in which they are written (i.e., C++ and CLOS, respectively). Another application (either written in C++ or CLOS) may need to access and manipulate these maps. A formal understanding of the differences and similarities between the C++ and CLOS map data definitions (i.e., type definitions) provides a basis for allowing other applications to access and process both the C++- and CLOS-defined maps.

3. Practical Foundations

Concomitant with our work our theoretical foundations for interoperability is our work on practical foundations. Based on our models, we have been designing, developing, and experimenting with tools that facilitate interoperability. Conch - A CONtext Controlling sHell - provides improved name management capabilities for multilanguage object-oriented databases, while PolySPIN and PolySPINner help automate the construction of software that accesses data defined in multiple programming languages. In this paper, we focus on PolySPIN. For details on Conch, see [KW94].

The PolySPIN approach provides a transparent interoperability mechanism for programming languages. More specifically, it provides support for polylingual interoperability[KW96], where applications can access compatible types defined in distinct languages as if they were defined in the language of the application. The fact that the types are defined and implemented in a different programming language is hidden from the application. Related to this mechanism is PolySPINner, a collection of tools that automates PolySPIN and supports type-safe polylingual interoperability [BKW96,B98,K96].

To better understand the concept of polylingual interoperability and its relationship to the PolySPIN mechanism and the PolySPINner toolset, we expand on the geographical map example outlined in Section 2. In this scenario, a data repository is populated with geographical maps, some of which have been defined in C++ and others defined in CLOS. (Data repositories provided by object-oriented databases [WBT92] provide such capabilities.) Later an application, perhaps written in CLOS, needs to access and manipulate these maps. Approaches that support polylingual interoperability allow the application to treat the maps as if they were all implemented in CLOS (i.e., hiding the fact that some of the maps are also implemented in C++). Figure 1 illustrates the concept.

Figure 1: A Polylingual Application

Although approaches such as standard file formats, relational databases, IDLs (e.g., CORBA, COM) support certain aspects of polylingual interoperability, our approach offers several advantages over such mechanisms:

To help illustrate these advantages, we describe how the PolySPINner toolset can be used to create a CLOS map application that accesses and manipulates maps defined using both CLOS and C++ type systems. Figure 2 gives an overview of the toolset, along with its inputs and outputs. The toolset takes as input the class definitions (i.e., interface and implementation) for the C++ and CLOS types. With respect to our scenario, there are CLOS and C++ type definitions for geographical maps. The types may match exactly, or more likely, there is an intersection of features that is relevant to the new CLOS map application. In any event, PolySPINner first parses the type definitions for each language and then determines whether the types are compatible (where compatibility can be specified as discussed in Section 2). If the types are deemed compatible, PolySPINner automatically re-engineers the implementations of all relevant operations according to the PolySPIN framework. The re-engineered operations include code that checks for the actual language implementation for objects and then invokes the appropriate operation implementation in the appropriate language.

For example, suppose both the CLOS and C++ class (type) definitions for the map objects provide operations for computing Euclidean distance. The re-engineered version of the implementation of the CLOS operation will first check whether the actual object to which it is being applied is implemented in CLOS or C++. If the object is implemented in CLOS, then the CLOS code is executed just as it would have been prior to PolySPINner's re-engineering. If the object, however, is implemented in C++, then the parameters to the CLOS operation are mapped to corresponding C++ representations and the C++ operation (corresponding to the invoked CLOS operation) is invoked. It is important to understand that the code supporting the interoperability logic is maintained by the PolySPIN mechanism (e.g., language implementation information) and generated by the PolySPINner toolset. Equally important, this interlanguage code is transparent to application. Thus, a CLOS application views and accesses all data as CLOS data. Furthermore, operation si gnatures remain unchanged and modifications are made only to the implementations of the relevant operations. With respect to software engineering concerns, this means that only the operations need to be compiled and therefore there is no impact on existing applications.

Figure 2: Conceptual Overview of PolySPINner

4. Summary

In this paper, we have given an overview of our activities toward establishing computer science foundations for transparent data exchange and integration. We have briefly described our work on models that help describe and assess approaches to type compatibility, name management and type safety approaches in polylingual software systems. We have introduced PolySPIN and PolySPINner, an approach and toolset that supports and facilitates the construction and maintenance of polylingual software systems. Although much of our work has been set in the context of programming languages, we believe that our results are applicable to the domain of scientific data and technical data exchange and integration. Programming languages are powerful tools for building scientific applications. The results from our formal and practical activities, therefore, should help ease the development of this particular class of applications. We are pursuing several directions in our current and future work. In the area of formal models, we plan to extend and improve our models so that they can account for other crucial aspects of transparent data exchange and integration, such as versions, evolution and dynamic system modification. We also plan on extending the PolySPIN mechanism and PolySPINner toolset. We would like to support programming languages in addition to CLOS and C++, such as Java and Ada95, as well as support improved interlanguage type safety mechanisms. Finally, we envision evaluating and experimenting with our approach in domains that have large scale demands, such as scientific applications.

References

[BKW96] Barrett, D.J., Kaplan, A. and Wileden, J.C.,
Automated Support for Seamless Interoperability in Polylingual Software Systems,
Fourth Symposium on the Foundations of Software Engineering, San Francisco, CA, October 1996.

[BRW98] Barrett, D.J., Ridgway, J.V.E. and Wileden, J.C.,
Polylingual Object Usage Made Easy ... and Safe (submitted)

[B98] Barrett, D.J.,
Polylingual Systems: An Approach to Seamless Interoperability, PhD Thesis, Department of Computer Science, University of Massachusetts, Amherst, MA, February 1998.

[K96] Kaplan, A.,
Name Management in Convergent Computing Systems: Models, Mechanisms and Applications, PhD Thesis, Technical Report TR--96--60, Department of Computer Science, University of Massachusetts, Amherst, MA, May 1996.

[KMRW96] Kaplan, A., Myrestrand, G.A., Ridgway, J.V.E. and Wileden, J.C.,
Our SPIN on Persistent Java: The JavaSPIN Approach,
Proceedings First International Workshop on Persistence and Java, Drymen, Scotland, September 1996.

[KRW97] Kaplan, A. Ridgway, J.V.E. and Wileden, J.C.,
Why IDLs Are Not Ideal, Ninth IEEE International Workshop on Software Specification and Design, Ise-Shima, Japan, April, 1998..

[KW96] Kaplan, A. and Wileden, J.C.,
Toward Painless Polylingual Persistence,
Proceedings Seventh International Workshop on Persistent Object Systems,
Cape May, NJ, May 1996,

[KW95] Kaplan, A. and Wileden, J.C.,
Formalization and Application of a Unifying Model for Name Management,
Third Symposium on the Foundations of Software Engineering,
Washington, D.C., October, 1995.

[KW94] Kaplan, A. and Wileden, J.C.,
Conch: Experimenting with Enhanced Name Management for Persistent Object Systems,
Proceedings Sixth International Workshop on Persistent Object Systems,
Washington, D.C., October, 1995.

[RTW97] Ridgway, J.V.E., Thrall, C. and Wileden, J.C.,
Toward Assessing Approaches to Persistence for Java,
Proceedings Second International Workshop on Persistence and Java, Half Moon Bay, CA, August 1997.

[WWRT91] Wileden, J.C., Wolf, A.L., Rosenblatt, W.R. and Tarr, P.L.,
Specification Level Interoperability, Communications of the ACM, 34:5, May 1991, pp. 72--87.

[WBT92] Wells, D.L., Blakely, J.A. and Thompson, C.W.
Architecture of an Open Object-Oriented Database Management System, IEEE Computer, 25(10), October, 1992, pp 74--82.