Finding the Data:
Using Metadata Indices to Locate and Obtain Data on the World Wide Web

B.N. Dorland, W.A. Snyder
E.O. Hulbert Center for Space Research
U.S. Naval Research Laboratory

R.D. Jones, S. Heinicke
Hughes STX Corporation

D.A. Becker
Massachussettes Institute of Technology Media Laboratory

1. Abstract

The Backgrounds Data Center (BDC) , located in the Space Sciences Division (SSD) of the Naval Research Laboratory (NRL), is the designated archive for celestial and earth backgrounds data collected by Ballistic Missile Defense Organization (BMDO) science research programs. The BDC will also be the prime archive for the upcoming Midcourse Space eXperiment (MSX) data set.
In order for us to keep track of all of this data across multiple programs and to understand it better in our role as the primary data archive center, we have constructed a multi-layered system of cataloging Metadata. As each new data item is received for archive, temporal, spatial, and spectral range information is automaticallly extracted and populated to a "summary" catalog database; more detailed information is also extracted and populated to a "program" catalog. Users can locate and obtain data using the Summary Catalog interface. Users can then use the detailed program catalogs to assist in the analysis of the data they have received.
We have constructed an easy to use interface which is based on Jason Ng's (NCSA) GSQL scripts. The GSQL scripts have been modified to permit access to the INGRES RDBMS (so far as we know, we are the first group to make this link) and the use of "on the fly" form and graphical image construction.

Projecting our experience onto the the global problem of locating data, we suggests the implementation of an international distributed database system. This SQL-based distributed server system would allow users to construct queries using standard HTML form sets and locate data at any of the participating servers aground the world.

2. A Paradigm for Archiving IR, Visible and UV Astronomical, Geophysical and Atmospheric Data

2.1 Introduction

The BDC, is responsible for the archiving of science data from a multitude of United States Department of Defense (DoD) research programs. Currently, data holdings are in the hundreds of gigabytes. We expect that within a few years the BDC will be the archive for tens of terabytes of data. These data are in the celestial backgrounds, atmospheric and geophysical disciplines and cover the spectrum from the extreme ultraviolet to the far-infrared.

Most of the expected new data are from the Midcourse Space Experiment (MSX) which is scheduled for launch this year. MSX is a multi-sensor platform which will obtain spectral, radiometric and image data of the Earth and the celestial backgrounds in the UV, Visible and IR regions of the spectrum. The BDC has expended great effort, in concert with the MSX Data Management organization, to ensure that the archived data will be easily accessible to users.

We have found that the a primary difficulty in being an archive center is in locating data which satisfies a user's needs. For example, if an investigator is interested in data in spectra in the extreme UV of the Eta Carina region how do we process this request? What if the investigator is interested in just one of the supported programs? What if he is interested in all of the supported programs?

2.2 Purpose

The purpose of this paper will be to begin by discussing the utility of relational database-based metadata catalogs in keeping track of astronomical and geophysical observations. Next, our use of WWW servers and client forms to allow access to our catalogs will be discussed. From there, we will suggest the use of more advanced, graphical interfaces to catalogs to augment the standard forms-based interfaces. Finally, a we have included proposal for a distributed system of databases serving the entire astronomical and geophysical community. The idea for this distributed system derives from our experiences with local catalogs.

2.3 A Basis for Cataloging Metadata

No matter how flashy a data querying interface might be, it will fail if an efficient and comprehensive schema between user and data does not exist. There must exist a method for mediating between the question of the user and the answer of the data. This we call the catalog, and we generally use a Relational Data Base Management System (RDBMS) to serve as the mediator. The RDBMS accepts the query from the user, compares the query to the data holdings and returns an answer.

The efficacy of the catalog (and thus the system as a whole) depends on how much and what sort of information is contained within it. The structure of the catalog dictates how much information should go into it while the data management system (i.e. the method for receiving and processing the data) determines how much of the information which should go in actually does.

It is useful to think in terms of the granularity of the catalog. By this we mean the level of detail which is contained for each discrete entry. A catalog with small granularity contains relatively many discrete entries each with a high level of detail associated with it, while a catalog with large granularity contains relatively fewer entries with less detail about each entry. We maintain two types of catalogs: the program catalog (small granularity) and the summary catalog (large granularity). Information about each program's data is recorded in detail in a specific program catalog. Information about data for each program at a larger granularity is recorded in the summary catalog.

We call this information about the data metadata. By metadata we mean information which describes the setting of the actual data product. For example, the IR spectra of central region of the Small Magellanic Cloud is data, but information about when the data was taken, what filter was used in the instrument, what RA and DEC , etc., is what we call metadata. It is by means of the metadata that we locate data for the investigator; it is to him to use the data once located. It is therefore the metadata which fills the catalogs.

One further distinction we make that is related to the catalog granularity is that of dataset range vs. data product. As the number of rows in a catalog increases, so does the query time, and as the number or programs n , each containing individual entries m, the total number of entries in a comprehensive database is (i=1 to n) S mi . At sufficiently large numbers of entries per program (e.g. hundreds of thousands) this forces the designers of the database to either tolerate extremely long search times, or to make the granularity larger and lose some information. We have addressed this problem by making the discrete data entry a relatively large granule and then recording range information rather than specific values. The summary catalog entries thus are composed of sets of data products--datasets--and associated ranges. The modus operandi for locating data is thus to construct a range query and submit it to the summary catalog. This results either in a negative answer, or a list of datasets which satisfies his query. The investigator can order the dataset(s) and then use the much more detailed information in the associated program catalog to select and locate actual data products such as scenes or spectra.

2.3 The Structure of the Metadata Catalog

The intent of this paper is not to discuss the detailed design of RDBMs, but we will make some relevant observations. There are a number of tricks to optimizing the processing time for queries which any good SQL programmer should employ. These involve such things as analysis of the typical types of queries expected, use of indices and multiple tables rather than single tables with a large number of columns, and so-called 'sanity checks' which return the number of potential row answers before an actual query is committed, etc. The catalog should be designed with speed and efficiency foremost rather than ease-of-use; construction of the optimal interface (vide sections 3 and 4) will shield the user from the inner workings of the catalog database.

2.4 Filling the Metadata Catalog

The final element to the Metadata Catalog is the production of the contents of the catalog. We have found that the optimal method for populating catalogs is to extract metadata directly from the processing stream. For MSX, for example, each data file which is received from the MSX Mission Operations Center enters the processing "pipeline". While there, the header from each data file is read and the delimited metadata is extracted and entered into a catalog population database resident on the cluster of processing machines. Detailed information is transferred to the MSX program catalog and range information is transferred to the summary catalog.

3. Accessing the Metadata Catalogs With Dynamic HTML Forms

3.1 The Utility of the HTML Forms Based Interface

The construction and population of the catalog database is only half the problem. To permit users to connect to the catalog, construct and submit queries, and receive and properly interpret results is the domain of the interface.

Interfaces have typically presented a multitude of problems. These problems take the form of a "trade-off". These include standardization vs. power and ease-of-use vs. versatility. A number of corollary problems typically arise when trade-offs are settled by making certain choices. The major problems we have noted are:

Fig. 1: Older, vt100-style catalog interface

Standardization vs. Power
This trade-off can take the form of having to choose between a very powerful graphical interface to a database or a much more limited but standard interface, such as VT100-style emulation (see fig.1) . The problem is that the powerful interfaces are typically restricted to certain platforms (e.g. an HP workstation) or even certain windowing environments (e.g. GL) and simply will not run on other machines. Standardization is not feasible across platforms, so software which allows for emulation must run at a "least common denominator" level. The more powerful interface might be perfectly suited to a situation where there is standardization among query-interface machines, such as when there is only one workstation used for the interface, or when a company enforces a standardized interface environment. Our user base is dispersed and heterogeneous. In the past we have built both types of interfaces; first, simple VT100 emulations for general use and specialized graphical interfaces for use at our analysis facility.

Ease of Use vs. Versatility
The interface to a database should be easy to use. This truism is often hard to implement, however. The way to construct a very easy to use interface is to make it simple and uncluttered. Present the user with a few choices and make their format intuitive to the discipline of the user. Unfortunately, by making his choices few and by making them simple, the versatility of the interface suffers. Users cannot specify precisely what they want because the tool doesn't permit it.

One solution to the simplicity problem is to present the user with many possible choices. The user still employs the interface to handle to database transactions, but he now has a large number of choices from which to choose. Unfortunately, the more information which is presented to the user, the more complex the interface becomes. Our program catalog for MSX, for example, has hundreds of settings to select specific instruments, filters, observation sets, lighting conditions , etc. Wading through all of these selections can be tedious and hierarchical interface layers can be confusing and may easily lose the user before a query can be constructed.

Another solution is to remove the interface altogether and permit direct access using SQL. Users must be SQL experts and must have access to database dictionaries, etc. for this to work. This is asking a lot of the investigator who is interested in getting his data, not in learning database query language.

3.2 Solving the Problems with Dynamic HTML Forms Construction

We have attacked the problem by building what we call a dynamic forms SQL interface. This initial interface was based on Jason Ng's (NCSA) GSQL program. We have built an interface to our summary catalog using HTML forms to translate user inputs into SQL queries. These queries are then used by the HTTPD server residing on an SGI 4D35 workstation to query an INGRES RDBMS residing on a VAX 4300. More information on the technical details is available in section 3.3.

We initially constructed a form which contained all possible query parameters and provided that to the user. The HTML form appeared to us to be the perfect tool for connecting users to a catalog database. It was standard across all supported hardware configurations, and it is inherently easy to use and has the potential (implementation is a different matter) for being intuitive in use. What we found, however, was that even at the high level of granularity of our summary catalog, and dealing solely with a limited number of query parameters, our form was cumbersome to use. Any thought of using this method for preparing queries for highly detailed program catalogs was unacceptable. Furthermore, a rigid form interface does not address the issue of ease of use vs. versatility. We still had to make compromises between giving the user a lot of choices and not making the task an impossible one to manage by non-SQL experts.

Fig. 2: Parameter Selection Form

We determined the best way of addressing these issues was to make our form interface dynamic. A preliminary page (see fig.2) mediates between the user and the query form. Users initially select whatever query parameters they want to use, hit the "build form" button and a form is built with just the requested elements.

Fig. 3a, b, c: Standard and Advanced Query Forms and the Editable Query box
These forms include sample query parameters.

Furthermore, they choose the type of query built from a list of three possibilities: standard, expert, or editable SQL (see fig. 3 a, b, c ). A standard query is one where users merely have to indicate high and low ranges and the interface builds the query with all possible cases. The advanced query gives users control over how the range specifications will be used to construct queries (e.g. >=parameter a, < parameter b), and the Boolean relation between the parameters. The editable form is similar to the advanced interface, but instead of processing the query, the query is built and then displayed to the user who may then edit the query language before it is submitted to the catalog. On each of the forms, the user may also specify parameter range values he wishes included with the output in addition to the default values returned as part of each query. The results of the query are returned to the user as a scroll-box text message which he can then submit as an order to our order processing system (see fig. 4).

Fig. 4: Result of Query to Summary Catalog
This form can be submitted to the Archive Center's
Order Processing system

3.3 Technical discussion of the dynamic HTML forms-based interface to the Summary Catalog

The Summary Catalog Interface consists of several programs, written in C, that utilize the Common Gateway Interface (CGI). The interface is based on Jason Ng's GSQL program.

When a user first enters the summary catalog interface he will see a form that consists of 10 CHECKBOX buttons, 3 RADIO buttons, and 2 buttons for resetting and submitting. Each of the CHECKBOX buttons corresponds to a query element (i.e. time/date range, spectral range, etc..) and each of the RADIO buttons corresponds to a type of query form (i.e. standard, advanced, or editable). Once the user has selected the desired query elements and the type of query. The contents of the form are submitted using the POST method to dsql_form.cgi.

Dsql_form.cgi takes the input (passed through STDIN, as per POST method) from the initial form and builds a form that contains only those elements selected on the initial form. The form generated by dsql_form.cgi comes in two flavors. If the standard option was selected, the form only has INPUT fields for the minimum and maximum of each query element. If either the advanced or editable query type were selected, a form is generated that contains not only INPUT fields for the minimum and maximum of each query element, but also SELECT fields that contain logical operators (AND, OR) for combining the various parts of the queries and range operators ( >=,>,=,<>,<,<= ) which affect the range of the queries. Both flavors will have a MULTIPLE SELECT field if the Program Acronym element was selected. In addition at the bottom of each form, are five CHECKBOX buttons that allow the user to choose what type of information he wants returned from the query. The user can select information regarding: temporal ranges, spectral ranges, celestial ranges (i.e. RA and DEC), tangent point ranges and pierce point ranges. The contents of this form are submitted with the POST method to one of three shell scripts depending on the type of query. The shell scripts are necessary to set certain PATH information for use by INGRES.

Once the PATH information is set, the scripts call their respective programs which parse the data, build an SQL query and submit it to INGRES for processing. Each of the three programs are slightly different.

The Standard Program takes the inputs and builds a standard query. This standard query is an expert level SQL query which does two things for users. First, the query returns a maximum number of data sets which meet the given parameters. This is achieved by building the query so that the data set's minimum value for a given parameter (e.g. the data set's minimum RA) has to be less than the maximum value for the parameter provided by the user and the data set's maximum value for a given parameter (e.g. the data set's maximum RA) has to be greater than the minimum value for the parameter provided by the user. This ensures the database returns all data sets which have values in the ranges provided by the user. Second, the query takes into account the circular nature of several of the parameters ( RA, longitude). These parameters are particularly complicated to work for two resaons:

1. the maximum value entered by the user can be less than the minimum value entered by the user

2. the maximum data set value for an entry can be less than the minimum value for the entry.

For example, with queries involving right ascension (RA), a user may wish to search in the range 23 hours to 2 hours and for a particular data set the RA ranges may be 23.5 hours to 1.5 hours. The query takes into account these facts by checking all possible cases.

The Advanced Program and the Editable Program take the inputs and build the query as defined by the user (remember that with the advanced and editable queries, the user selects both logical and range operators). Once the queries are built they are passed to a backend written in C and embedded SQL. Before the Editable Program submits the query to INGRES, it first displays the query in a user-editable TEXT field. If the user wishes, he can modify the query before submitting it to the backend. The backend submits the query to the INGRES database, which processes it and returns both the standard results (Program Acronym, Product ID, and Dataset Description), and any additional information (Temporal Ranges, Spectral Ranges, Celestial Ranges, Tangent Point Ranges, and Pierce Point Ranges) the user requested.

3.3 Implications of the dynamic form of querying Metadata catalogs

With this method in place, building queries for metadata catalogs can now be done at the level of complexity and required expertise of the user's choosing. All queries, no matter how complex the final form of the query language, can be constructed easily with no knowledge of databases. Databases with a large number of possible input parameters can now be presented to the user without any complexity beyond what he needs. Interface with the database can now occur at the level of expertise the user chooses. Thus, the problems noted above, viz. standardization vs. power and ease-of-use vs. versatility are both satisfactorily addressed.

The combination of this method of constructing interfaces along with intelligent design and correct population of catalog databases makes locating data by means of metadata query an easy to use and powerful tool for the user of archived data.

4. Enhancement of the Interface by means of Dynamic Image Maps

We are now investigating adding a graphical layer between the user and the forms interface. This would permit even easier construction of queries. They would be so easy that users would not be aware even that queries are being made against a database. Users would be able to simply point and click on a map or other graphical tool and obtain the requested data.

Fig. 5: IBSS Homepage

We have built a prototype version of this system for the Infrared Background Signature Survey (IBSS), a DoD experiment program that included Earth Backgrounds observations using radiometer, spectrometer and imager data taken predominantly in the IR (see fig. 5). We have successfully petitioned the DoD to publicly release all of the Earth Backgrounds data. We have placed all of the IBSS data on-line and constructed a World Wide Web system for accessing this data using dynamic image map construction (see fig. 6a).

Fig. 6a, b, c, d: Dynamic Image Map Data Selection Tools for IBSS
Shown: a) Initial interface b) Auroral Observations selected c) Multiple observations selected d) Earthlimb atmospheric cross-section

The data is presented at the same level as in the summary catalog, viz. in the dataset group. A single auroral observation over Alaska with a twelve minute duration is a single data set, for example (see fig. 6b). We decided that the best way to present all of the datasets to the user for selection is to place the various earthscan, solar specular point, and auroral observations as traces on a map of the Earth (see fig. 6c) and to present Earth Limb observations as plots on a image of the cross-section of the atmosphere (see fig. 6d). Users are able to point and click on a particular trace and immediately obtain the dataset which is represented. We use standard image mapping techniques to accomplish this. To avoid the clutter problem (similar to the cluttered forms mentioned above), a dynamic map is built. Users select one or more of the observation types and hits the return button. An on-the-fly .gif imagemap is created which displays only the traces of the selected observations. A user can then click on or near the dataset of interest and receive the science data (see fig. 7). As a side note, polygonal definition is no longer used with image maps, but rather a least-squares method of determining which of the displayed traces is the nearest, thus the one being selected.

Fig. 7: Data Products page returned to user after use of selection tool

The importance of this to SQL interface is clear. An additional, and even easier to use interface can now be incorporated into the entire interface package. This must be done prudently, as not all information is necessarily displayed best graphically. Spectral ranges, for example, do not gain anything by being displayed as a rainbow, but in fact may lose some precision if actual numbers are not entered into a field. Furthermore, this interface must become part of the described methodology; i.e. an aspect of the entire interface and not an outright replacement.

5. Proposal for a Distributed, Heterogeneous World Wide Science Database

5.1 Locating Data on the Internet

At our archive center we are frequently tasked with providing access to our own data, but with locating and obtaining data from outside sources. We have examined the problem of tracking data products within a single archive center through the use of metadata catalogs and HTML interfaces. What we would like to do now is extend the discussion now is the extension of the soulution to the community of data archive centers on the Internet.

The problem of locating data outside of an archive center--on the Internet--is a much larger problem. While there are a number of useful tools, such as meta-indices, for finding data, it is generally a matter of luck as to whether one can find the precise data of interest. Consider this situation: a user is looking for short-wave infrared data of southern California with a spatial resolution of 100-1000 km. How can he find this?

Typically, a user now will launch out over the WWW and poke around in likely places and try to find some information here and there which may point him in the right direction. Consider how useful the following tool would be, however: a user would connect to a standard host (much like an archie host) and fill out a dynamic Image- and forms-based HTML query. He would be able to select a discipline, and also be able to select the query parameters applicable to the discipline. He would be presented with a query form (or imagemap) based on his preliminary selections. He would then fill out the query, and the query server would then distribute the query to a world-wide web of database servers. Each of these would then process the query and return the answer to the client. He would have a list of data products which satisfied his query, and links to the products or information on ordering them along with the list.

Such a system would be a powerful tool for locating and obtaining science data on the internet. By making the functionality dynamic, it could incorporate all sorts of science disciplines and not be restricted to simple astronomy or geophysics. Query interface tools would be modular and added or modified without affecting the rest of the (parallel) tools.

5.2 The Heterogeneous Server Environment

Construction of the interface for such a system would be relatively easy. Managing a distributed environment of servers would be quite a task, however. How would the HTTPD query servers know where to query? Would they send out standard SQL language? Would each database server be required to adhere to a strict set of standardized interactions with the query clients? Would each database have to adhere to a standard database design?

Our opinion is that the easiest way to construct such a system would not be to enforce standardization, but rather to build specialized interface tools which would handle interaction between and environment of distributed, heterogeneous database servers and the query clients. Standardization would in all likelihood not work, and the system would likely fall quickly into disrepair. If we allow each data archive center to maintain its own functional database and spend our effort in creating an interface between the database and the query system, we feel that an operational system would be feasible. An example of a distributed, heterogeneous system of databases was described by Ford and Stern at the Second International World Wide Web conference in 1993. They describe a much smaller system composed of at most 20 databases, but the principal can easily be extended to an international system of hundreds of databases. We feel that such a system would be a large step in managing the oddly dual nature of information on the Internet now--an overabundance of information and a paucity of information management.

Bibliography

Shelley G. Ford and Robert C. Stern, Omniport: Integrating Legacy Data into the WWW", presented at the Second International World Wide Web conference, October 17, 1994

Finding the Data: Using Metadata Indices to Locate and Obtain Data on the World Wide Web

Table of Contents