Querying Astronomical Catalogues with the Virtual Observatory

Clive Davenhall (acd@roe.ac.uk)
Version 1, 25 March 2002

Introduction,
Types of Queries,
Format of the Returned Results,
Use Case,
Discussion,
Example Queries,

Introduction

This note discusses the types of queries which the Virtual Observatory (VO) should be able to make on an astronomical catalogue. I'll use the term `catalogue' throughout for convenience, but it should be understood to include archives and surveys of tabular data. For the bulk of the document I'll describe the queries in an abstract way, essentially as a text string, and say nothing about the transport protocols and mechanisms that might be used to transmit them to a remote server hosting a copy of the catalogue. However, it is certainly possible to invent an XML representation for the queries and wrap them in a SOAP envelope. The final section gives an example of such an XML representation.

The note is intended as a basis for discussion. It represents only my own opinion and alternative views are certainly possible. I deliberately describe the facilities which I think are desirable, without worrying unduly about how difficult they might be to implemented. Conversely, there are probably things that I have forgotten. The facilities are described in sufficient detail to illustrate the discussion, but I don't give formal and complete specifications.

Before describing how an astronomical catalogue might be queried it is necessary to have a model of what constitutes such a catalogue. Here I'll use a model similar to that adopted explicitly by CURSA and implicitly by the VOTable. A catalogue comprises the following elements.

Description: A human-readable description of the catalogue.
Columns: A set of definitions of all the columns in a catalogue. Each column has a name, data type, units, description and several similar quantities.
Parameters: A set of definitions of items which pertain to the entire catalogue. Each parameter has a name, value, data type, units, description and several similar quantities.
Table: The table of values which comprise the bulk of the catalogue.
Indices: Any indices existing on columns in the catalogue.

In the context of the VO it is necessary that queries should be able to operate on catalogue parameters as well as columns. One reason is that in the VO queries might be trying to identify catalogues that are suitable for a given purpose, as well as selecting rows from a known catalogue.

I hope that the types of query described in the rest of the note cover the requirements for querying catalogues in sidereal and solar astronomy (solar catalogues are usually the equivalent of `observing logs' in sidereal astronomy). However, I don't think that they cover all the requirements in Solar-Terrestrial Physics, where, in some cases, the queries are required to extend beyond querying the observing log to extracting values from the bulk data files which the observing logs tabulate. I'm rather flumoxed by how to address this requirement.

Types of Queries

Assume that the VO client/portal has identified a service, such as the CDS or LEDAS, hosting a collection of astronomical catalogues. It now needs to make inquiries of the service to discover which catalogues might be relevant to solving the problem to hand and then to submit detailed queries to the chosen catalogues. The client/portal may be able to obtain some summary information about the service (and perhaps individual catalogues) from whatever registry it used to find the service, but at least some of the details will probably have to be obtained by querying the service itself. The types of inquiry which the client/portal needs to be able to make of the service include the following.

List of catalogues: Return a list of all the catalogues hosted by the service. The information provided for each catalogue should include the identifier by which the service knows it and probably a short description of the catalogue which can be presented to the user.
Summary details of an individual catalogue: Return details of an individual catalogue. Typically this information is used to allow the client/portal (possibly steered by the user) to decide whether the catalogue is relevant to the problem to hand. The information provided should include: wavelength range, sky coverage, limiting magnitude/sensitivity, spatial resolution etc. We need to agree a standard set. One (but not the only) way to store this information is a parameters added to the individual catalogues.
Columns in an individual catalogue: Return a list of the names (as used in queries etc.) of all the columns in an individual catalogue.
Details of a given column in an individual catalogue: Return details of a given, named column in an individual catalogue. The details for each column include: data type, units, CDS UCD, description etc.
Indexed columns an individual catalogue: Return a list of all the columns in an individual catalogue which are indexed or sorted (and hence on which fast selections can be presumed to be available).
Two-dimensional indices in an individual catalogue: Return a list of any two-dimensional indices which exist on pairs of columns in an individual catalogue (again to establish what fast selections are available).
Number of rows in an individual catalogue: Return the number of rows in an individual catalogue.
Types of query supported: Return a list of the types of catalogue query (see below) which the service supports. For example, a basic service might only support `RANGE, CONE' (using the nomenclature introduced below). A more capable service might return `RANGE, CONE, ARB, FREQ, LITBIG, POLY'. I'm assuming that a given service will support the same types of query on all its catalogues. If not, then the inquiry would have to be made on the basis of individual catalogues.

Most of the above types of inquiry are used during a preamble in which the client/portal engages in a dialogue with the service to identify one or more catalogues that are relevant to the problem to hand and to set itself up ready to query them. Two additional mechanisms are needed to submit a query to select rows from a catalogue according to some criteria:

submit the query and return an estimate (probably spelt g-u-e-s-s) of how long the query will take to complete.
submit the query and return a table of rows which satisfy the criteria.

A query comprises (in an abstract sense):

a string defining the type of the query,
a string defining the details of the query,
a list of columns in the catalogue which are to be returned.

Turning to the last item first, the list of columns will typically be a subset of the columns in the catalogue. There should be some easy mechanism to indicate that all the columns in the catalogue are required, maybe `*'. It is an open question whether the list should allow new columns to be projected. For example, if a catalogue contained columns A, B and C then should lists of the form `A, B, D = 2.0*A*log(B/C)' be allowed?

The query may be of one of several different types. Both the type and the details of the individual query need to be specified. Possible query types include the following.

RANGE: column-name > minimum-value and/or column-name < maximum-value: Select the values that lie within a particular range for a given column. This type of query is approximately equivalent to the CDS ASU.
CONE: RA-column Dec-column central-RA central-Dec radius-in-arcmin: A `cone search' to find the objects within a given angular radius of a given central celestial coordinate.
ARB: arbitrary-algebraic-expression: An arbitrary expression constructed from column and parameter names, arithmetic operators, brackets, mathematical, astronomical and statistical functions and relational operators. The astronomical functions required include great circle distance (gcd) and position angle (pa). Note that columns and parameters can both occur in expressions, which implies that they occupy the same name-space. As an example consider a catalogue where A, B and C are columns and P and Q are parameters, then an expression might be: A + (P*sin(B) > 3.0 + log(C/Q)
FREQ: n: Select every nth row from the catalogue.
LITBIG: LITTLE or BIG column-name n: Select the n largest or smallest entries in the catalogue, according to the order in column column-name.
POLY: INSIDE or OUTSIDE x-column y-column n list-of-xy-coords-of-corners: Select points inside or outside a polygon drawn in the plane defined by columns x-column y-column. The polygon has n corners and a list giving the coordinates of these corners is supplied. (The algorithm for performing this selection involves, for each row in the catalogue, drawing a line from the point corresponding to the row to a point outside the polygon and checking how many times it crosses the polygon boundary. If the number is even the row is outside.)

RANGE and CONE are special cases of ARB. The advantage of providing them separately is that because they are simpler the service can arrange to use any indices that are available on the columns involved. There are substantial difficulties involved in using indices in ARB expressions.

An additional complication (as if any were needed!) would be to allow projected columns to appear in ARB expressions using an SQL-like WHERE clause. For example, again suppose that a catalogue has columns A, B and C and parameters P and Q, then:

{A + (P*sin(B) > 3.0 + log(D/Q)} WHERE {D = 2.0*log(B/C)}

Format of the Returned Results

The selected rows are returned from the service to the client/portal. The VOTable format was invented to represent such selections and is entirely suitable for this purpose. In addition to the table of selected rows (or a pointer to a separate binary representation thereof) the VOTable should contain:

the complete description of the original catalogue,
all the parameters of the original catalogue,
full definitions of the columns (FIELDS in VOTable notation) requested in the original query.

Also, it is often useful if the returned table can be annotated with a description of the original query. These details can form a useful aide-memoire during subsequent use of the results. (The more recent CURSA applications preserve such information, once I'd realised that it was a good idea). The obvious way to represent these details is as a set of parameters, all of type varstring. The following parameters should be adequate.

QUERYTYPE: The type of query. One of: RANGE, CONE, ARB, FREQ, LITBIG, POLY.
QUERYEXPR: The query expression or details.
QUERYFIELDS: A list of the columns returned. This string is a copy of the list of columns submitted in the query. This parameter may be un-necessary as the VOTable necessarily contains details of the columns that it contains.
QUERYCOMMENTS: A free-text description supplied by the user to annotate the query. (If this information is required then the query submission mechanisms would have to be modified to incorporate it.)
QUERYTIME: The date and time of the query in ISO format. The time recorded is the local civil time on the server that executed the query.

Use Case

The FREQ, LITBIG and POLY type selections are relatively unusual and perhaps I should provide the outline of a use-case, illustrating how they might be used. Suppose that the user is looking for outlying objects in a large catalogue of, say, 10**8 objects. His first step is to extract every 10,000th row to get a representative subset (which will still comprise some 10,000 objects). This subset is returned and plotted as a scatterplot using some pair of columns which are relevant to the problem to hand (to fix ideas the scatterplot might be a colour - magnitude diagram). The user then interactively draws a polygon around the locus where most of the objects lie. He submits the corners of this polygon as a POLY selection and the objects outside the polygon are selected as outliers. The user knows that he will never get enough observing time to look at more than a handful of the brightest of these objects, so he submits a LITBIG query to return the hundred brightest of the outliers.

Of course, drawing a polygon on a scatterplot is an entirely subjective way to make a selection (which is not to say that astronomers won't want to do it). However, a more objective alternative might be to grid the points in the scatterplot into a two-dimensional array and generate a set of contours from this array (the contour levels show the number of objects in each element of the array). The user would then choose one of the contours and the points which define it would be extracted and submitted as a POLY selection.

Discussion

The types of queries described in this note are deliberately quite ambitious, and are matched to the astronomical requirements rather than the functionality usually provided by a DBMS. They are not, however, impossible to provide; similar facilities to most of them are available for querying small-to-medium sized catalogues with catselect in CURSA. Conversely, the queries only address the problem of searching a single catalogue. The possibility of searching joined or paired catalogues may also need to be considered.

The note has talked about a VO client/portal sending a query to a remote `service'. In practice such a service will probably need a VO `front end' or `top layer' to accept the query and translate it into whatever format is required by the service proper. The underlying service will often be a legacy system rather than something constructed specifically for the VO. Note that allowing catalogue parameters in the arbitrary expressions defining queries precludes any naïve translation of the query into SQL.

Finally, the note has assumed that the only result of a query is that the selected set of objects are returned to the client/portal as a VOTable. Another possibility, which may be desirable, is to save the selection on the remote service, where it would be accessible for future use. It could be saved as either a separate catalogue or an `index' into the selected rows in the base catalogue.

Example Queries

This section illustrates how the queries could be represented using an XML element called VOCATQUERY. The name VOCATQUERY is chosen because the query is specific to a single catalogue hosted by a service. VO queries might simultaneously query several catalogues hosted by a given service and there could be a VOQUERY element comprising several VOCATQUERY elements.

<VOCATQUERY qtype= RANGE | CONE | ARB | FREQ | LITBIG | POLY>
  <EXPR>
   Expression or details of the query.
  </EXPR>
  <FIELDS>
   List of columns to be returned in the VOTable.
  </FIELDS>
  <DESCRIPTION>
   Description of the query supplied by the user to annotate the query.
  </DESCRIPTION>
</VOCATQUERY>

Examples

<VOCATQUERY qtype=RANGE>
  <EXPR>
   B < 18.0
  </EXPR>
  <FIELDS>
   RA, DEC, B, B_V, U_B
  </FIELDS>
  <DESCRIPTION>
   Example RANGE query.
  </DESCRIPTION>
</VOCATQUERY>

<VOCATQUERY qtype=CONE>
  <EXPR>
   RA DEC  12:30:00  -30:12:00  20
  </EXPR>
  <FIELDS>
   RA, DEC, B, B_V, U_B
  </FIELDS>
  <DESCRIPTION>
   Example CONE query; objects within 20 arcmin of 12:30:00 -30:12:00.
  </DESCRIPTION>
</VOCATQUERY>

<VOCATQUERY qtype=ARB>
  <EXPR>
   (B < 18.0) & (B_V > 0.3)
  </EXPR>
  <FIELDS>
   RA, DEC, B, B_V, U_B
  </FIELDS>
  <DESCRIPTION>
   Example of arbitrary expression.
  </DESCRIPTION>
</VOCATQUERY>

<VOCATQUERY qtype=FREQ>
  <EXPR>
   100
  </EXPR>
  <FIELDS>
   RA, DEC, B, B_V, U_B
  </FIELDS>
  <DESCRIPTION>
   Example FREQ query; select every hundredth object.
  </DESCRIPTION>
</VOCATQUERY>

<VOCATQUERY qtype=LITBIG>
  <EXPR>
   LITTLE B 100
  </EXPR>
  <FIELDS>
   RA, DEC, B, B_V, U_B
  </FIELDS>
  <DESCRIPTION>
   Example LITBIG query; select the 100 brightest objects in the B band.
  </DESCRIPTION>
</VOCATQUERY>

<VOCATQUERY qtype=POLY>
  <EXPR>
   OUTSIDE B_V U_B 30 list-of-coords-of-polygon-corners
  </EXPR>
  <FIELDS>
   RA, DEC, B, B_V, U_B
  </FIELDS>
  <DESCRIPTION>
   Example POLY query.
  </DESCRIPTION>
</VOCATQUERY>