Querying Astronomical Catalogues with the Virtual Observatory
Clive Davenhall (acd@roe.ac.uk
)
Version 1, 25 March 2002
This note discusses the types of queries which the Virtual Observatory (VO)
should be able to make on an astronomical catalogue. I'll use the term
`catalogue' throughout for convenience, but it should be understood to
include archives and surveys of tabular data. For the bulk of the document
I'll describe the queries in an abstract way, essentially as a text string,
and say nothing about the transport protocols and mechanisms that might be
used to transmit them to a remote server hosting a copy of the catalogue.
However, it is certainly possible to invent an XML representation for the
queries and wrap them in a SOAP envelope. The final section gives an
example of such an XML representation.
The note is intended as a basis for discussion. It represents only my own
opinion and alternative views are certainly possible. I deliberately
describe the facilities which I think are desirable, without worrying unduly
about how difficult they might be to implemented. Conversely, there are
probably things that I have forgotten. The facilities are described in
sufficient detail to illustrate the discussion, but I don't give formal and
complete specifications.
Before describing how an astronomical catalogue might be queried it is
necessary to have a model of what constitutes such a catalogue. Here I'll
use a model similar to that adopted explicitly by
CURSA and implicitly by the VOTable. A catalogue comprises the
following elements.
- Description
- A human-readable description of the catalogue.
- Columns
- A set of definitions of all the columns in a catalogue. Each column
has a name, data type, units, description and several similar
quantities.
- Parameters
- A set of definitions of items which pertain to the entire catalogue.
Each parameter has a name, value, data type, units, description and
several similar quantities.
- Table
- The table of values which comprise the bulk of the catalogue.
- Indices
- Any indices existing on columns in the catalogue.
In the context of the VO it is necessary that queries should be able to
operate on catalogue parameters as well as columns. One reason is that in
the VO queries might be trying to identify catalogues that are suitable for
a given purpose, as well as selecting rows from a known catalogue.
I hope that the types of query described in the rest of the note cover the
requirements for querying catalogues in sidereal and solar astronomy (solar
catalogues are usually the equivalent of `observing logs' in sidereal
astronomy). However, I don't think that they cover all the requirements in
Solar-Terrestrial Physics, where, in some cases, the queries are required to
extend beyond querying the observing log to extracting values from the bulk
data files which the observing logs tabulate. I'm rather flumoxed by how to
address this requirement.
Assume that the VO client/portal has identified a service, such as the CDS or LEDAS, hosting a collection of
astronomical catalogues. It now needs to make inquiries of the service to
discover which catalogues might be relevant to solving the problem to hand
and then to submit detailed queries to the chosen catalogues. The
client/portal may be able to obtain some summary information about the
service (and perhaps individual catalogues) from whatever registry it used
to find the service, but at least some of the details will probably have to
be obtained by querying the service itself. The types of inquiry which the
client/portal needs to be able to make of the service include the following.
- List of catalogues
- Return a list of all the catalogues hosted by the service. The
information provided for each catalogue should include the identifier
by which the service knows it and probably a short description of the
catalogue which can be presented to the user.
- Summary details of an individual catalogue
- Return details of an individual catalogue. Typically this
information is used to allow the client/portal (possibly steered by the
user) to decide whether the catalogue is relevant to the problem to
hand. The information provided should include: wavelength range, sky
coverage, limiting magnitude/sensitivity, spatial resolution etc. We
need to agree a standard set. One (but not the only) way to store this
information is a parameters added to the individual catalogues.
- Columns in an individual catalogue
- Return a list of the names (as used in queries etc.) of all the
columns in an individual catalogue.
- Details of a given column in an individual catalogue
- Return details of a given, named column in an individual catalogue.
The details for each column include: data type, units, CDS UCD,
description etc.
- Indexed columns an individual catalogue
- Return a list of all the columns in an individual catalogue which
are indexed or sorted (and hence on which fast selections can be
presumed to be available).
- Two-dimensional indices in an individual catalogue
- Return a list of any two-dimensional indices which exist on pairs of
columns in an individual catalogue (again to establish what fast
selections are available).
- Number of rows in an individual catalogue
- Return the number of rows in an individual catalogue.
- Types of query supported
- Return a list of the types of catalogue query (see below) which
the service supports. For example, a basic service might only support
`RANGE, CONE' (using the nomenclature introduced below). A more
capable service might return `RANGE, CONE, ARB, FREQ, LITBIG, POLY'.
I'm assuming that a given service will support the same types of query
on all its catalogues. If not, then the inquiry would have to be made
on the basis of individual catalogues.
Most of the above types of inquiry are used during a preamble in which the
client/portal engages in a dialogue with the service to identify one or more
catalogues that are relevant to the problem to hand and to set itself up
ready to query them. Two additional mechanisms are needed to submit a query
to select rows from a catalogue according to some criteria:
- submit the query and return an estimate (probably spelt g-u-e-s-s)
of how long the query will take to complete.
- submit the query and return a table of rows which satisfy the
criteria.
A query comprises (in an abstract sense):
- a string defining the type of the query,
- a string defining the details of the query,
- a list of columns in the catalogue which are to be returned.
Turning to the last item first, the list of columns will typically be a
subset of the columns in the catalogue. There should be some easy mechanism
to indicate that all the columns in the catalogue are required, maybe `*'.
It is an open question whether the list should allow new columns to be
projected. For example, if a catalogue contained columns A, B and C then
should lists of the form `A, B, D = 2.0*A*log(B/C)' be allowed?
The query may be of one of several different types. Both the type and the
details of the individual query need to be specified. Possible query types
include the following.
- RANGE: column-name > minimum-value and/or
column-name < maximum-value
- Select the values that lie within a particular range for a given
column. This type of query is approximately equivalent to the CDS ASU.
- CONE: RA-column Dec-column central-RA central-Dec
radius-in-arcmin
- A `cone search' to find the objects within a given angular radius
of a given central celestial coordinate.
- ARB: arbitrary-algebraic-expression
- An arbitrary expression constructed from column and parameter names,
arithmetic operators, brackets, mathematical, astronomical and
statistical functions and relational operators. The astronomical
functions required include great circle distance (
gcd
)
and position angle (pa
). Note that columns and parameters
can both occur in expressions, which implies that they occupy the same
name-space. As an example consider a catalogue where A, B and C are
columns and P and Q are parameters, then an expression might be:
A + (P*sin(B) > 3.0 + log(C/Q)
- FREQ: n
- Select every nth row from the catalogue.
- LITBIG: LITTLE or BIG column-name n
- Select the n largest or smallest entries in the catalogue,
according to the order in column column-name.
- POLY: INSIDE or OUTSIDE x-column y-column n
list-of-xy-coords-of-corners
- Select points inside or outside a polygon drawn in the plane defined
by columns x-column y-column. The polygon has n
corners and a list giving the coordinates of these corners is supplied.
(The algorithm for performing this selection involves, for each row in
the catalogue, drawing a line from the point corresponding to the row to
a point outside the polygon and checking how many times it crosses the
polygon boundary. If the number is even the row is outside.)
RANGE and CONE are special cases of ARB. The advantage of providing them
separately is that because they are simpler the service can arrange
to use any indices that are available on the columns involved. There are
substantial difficulties involved in using indices in ARB expressions.
An additional complication (as if any were needed!) would be to allow
projected columns to appear in ARB expressions using an SQL-like WHERE
clause. For example, again suppose that a catalogue has columns A, B and C
and parameters P and Q, then:
{A + (P*sin(B) > 3.0 + log(D/Q)} WHERE {D = 2.0*log(B/C)}
The selected rows are returned from the service to the client/portal. The
VOTable format was invented to represent such selections and is entirely
suitable for this purpose. In addition to the table of selected rows (or a
pointer to a separate binary representation thereof) the VOTable should
contain:
- the complete description of the original catalogue,
- all the parameters of the original catalogue,
- full definitions of the columns (FIELDS in VOTable notation)
requested in the original query.
Also, it is often useful if the returned table can be annotated with a
description of the original query. These details can form a useful
aide-memoire during subsequent use of the results. (The more recent
CURSA applications preserve such information, once I'd realised that it was
a good idea). The obvious way to represent these details is as a set of
parameters, all of type varstring
. The following parameters
should be adequate.
- QUERYTYPE
- The type of query. One of: RANGE, CONE, ARB, FREQ, LITBIG, POLY.
- QUERYEXPR
- The query expression or details.
- QUERYFIELDS
- A list of the columns returned. This string is a copy of the list
of columns submitted in the query. This parameter may be un-necessary
as the VOTable necessarily contains details of the columns that it
contains.
- QUERYCOMMENTS
- A free-text description supplied by the user to annotate the query.
(If this information is required then the query submission mechanisms
would have to be modified to incorporate it.)
- QUERYTIME
- The date and time of the query in ISO format. The time recorded is
the local civil time on the server that executed the query.
The FREQ, LITBIG and POLY type selections are relatively unusual and perhaps
I should provide the outline of a use-case, illustrating how they might be
used. Suppose that the user is looking for outlying objects in a large
catalogue of, say, 10**8 objects. His first step is to extract every
10,000th row to get a representative subset (which will still comprise some
10,000 objects). This subset is returned and plotted as a scatterplot using
some pair of columns which are relevant to the problem to hand (to fix ideas
the scatterplot might be a colour - magnitude diagram). The user then
interactively draws a polygon around the locus where most of the objects
lie. He submits the corners of this polygon as a POLY selection and the
objects outside the polygon are selected as outliers. The user knows that
he will never get enough observing time to look at more than a handful of
the brightest of these objects, so he submits a LITBIG query to return the
hundred brightest of the outliers.
Of course, drawing a polygon on a scatterplot is an entirely subjective way
to make a selection (which is not to say that astronomers won't want to do
it). However, a more objective alternative might be to grid the points in
the scatterplot into a two-dimensional array and generate a set of contours
from this array (the contour levels show the number of objects in each
element of the array). The user would then choose one of the contours and
the points which define it would be extracted and submitted as a POLY
selection.
The types of queries described in this note are deliberately quite
ambitious, and are matched to the astronomical requirements rather than
the functionality usually provided by a DBMS. They are not, however,
impossible to provide; similar facilities to most of them are available
for querying small-to-medium sized catalogues with catselect
in
CURSA. Conversely, the queries only address the problem of searching a
single catalogue. The possibility of searching joined or paired catalogues
may also need to be considered.
The note has talked about a VO client/portal sending a query to a remote
`service'. In practice such a service will probably need a VO `front end'
or `top layer' to accept the query and translate it into whatever format
is required by the service proper. The underlying service will often be a
legacy system rather than something constructed specifically for the VO.
Note that allowing catalogue parameters in the arbitrary expressions
defining queries precludes any naïve translation of the query into SQL.
Finally, the note has assumed that the only result of a query is that the
selected set of objects are returned to the client/portal as a VOTable.
Another possibility, which may be desirable, is to save the selection
on the remote service, where it would be accessible for future use. It
could be saved as either a separate catalogue or an `index' into the
selected rows in the base catalogue.
This section illustrates how the queries could be represented using an XML
element called VOCATQUERY. The name VOCATQUERY is chosen because the query
is specific to a single catalogue hosted by a service. VO queries might
simultaneously query several catalogues hosted by a given service and there
could be a VOQUERY element comprising several VOCATQUERY elements.
<VOCATQUERY qtype= RANGE | CONE | ARB | FREQ | LITBIG | POLY>
<EXPR>
Expression or details of the query.
</EXPR>
<FIELDS>
List of columns to be returned in the VOTable.
</FIELDS>
<DESCRIPTION>
Description of the query supplied by the user to annotate the query.
</DESCRIPTION>
</VOCATQUERY>
Examples
<VOCATQUERY qtype=RANGE>
<EXPR>
B < 18.0
</EXPR>
<FIELDS>
RA, DEC, B, B_V, U_B
</FIELDS>
<DESCRIPTION>
Example RANGE query.
</DESCRIPTION>
</VOCATQUERY>
<VOCATQUERY qtype=CONE>
<EXPR>
RA DEC 12:30:00 -30:12:00 20
</EXPR>
<FIELDS>
RA, DEC, B, B_V, U_B
</FIELDS>
<DESCRIPTION>
Example CONE query; objects within 20 arcmin of 12:30:00 -30:12:00.
</DESCRIPTION>
</VOCATQUERY>
<VOCATQUERY qtype=ARB>
<EXPR>
(B < 18.0) & (B_V > 0.3)
</EXPR>
<FIELDS>
RA, DEC, B, B_V, U_B
</FIELDS>
<DESCRIPTION>
Example of arbitrary expression.
</DESCRIPTION>
</VOCATQUERY>
<VOCATQUERY qtype=FREQ>
<EXPR>
100
</EXPR>
<FIELDS>
RA, DEC, B, B_V, U_B
</FIELDS>
<DESCRIPTION>
Example FREQ query; select every hundredth object.
</DESCRIPTION>
</VOCATQUERY>
<VOCATQUERY qtype=LITBIG>
<EXPR>
LITTLE B 100
</EXPR>
<FIELDS>
RA, DEC, B, B_V, U_B
</FIELDS>
<DESCRIPTION>
Example LITBIG query; select the 100 brightest objects in the B band.
</DESCRIPTION>
</VOCATQUERY>
<VOCATQUERY qtype=POLY>
<EXPR>
OUTSIDE B_V U_B 30 list-of-coords-of-polygon-corners
</EXPR>
<FIELDS>
RA, DEC, B, B_V, U_B
</FIELDS>
<DESCRIPTION>
Example POLY query.
</DESCRIPTION>
</VOCATQUERY>