Comments on VOTable 0.4

Clive Davenhall, David Giaretta, Bob Mann, Clive Page, Guy Rixon

23 Jan 2002

 

Here are some combined comments from Clive Page, Clive Davenhall, Guy Rixon, Bob Mann and David Giaretta.

The comments are divided into three types:

- overall comments/thoughts about the general approach

- technical suggestions/corrections

- comments that are more editorial in nature

General Approach

It was felt that there was a lack of explanation of the motivation for yet another format, and in particular and explanation of the name VOTable - what are its advantages for VO work?

The importance of XML is clear, but it is not clear what the use of XML brings in this case. For example is the use of XSLT planned?; should we be looking at the use of Schema which would allow the greater use of XML tools?

It is also unclear whether VOTable is regarded as an interchange format or an on-line format for use in day-to-day analysis. Is it also unclear if this is supposed to be adequate for large tables or only for small tables. The combination of pure-XML with non-XML data has advantages for large tables but does cause difficulties.

Is it a design goal that the transformation FITS Table ==> VOTable  ==> FITS Table should be guaranteed not to lose information.

One of the additional functionalities noted is the use of the VOTable as a query. Yet it did seem that there may be some omissions from its capabilities in these regards.

The technical comments below expand on several of these points.

Technical Comments

Omissions wrt FITS

There is a concern about the omission from VOTable of metadata which applies to the table as a whole. This kind of parameter is very valuable. For example information such as that to do with provenance is likely to be very significant. Parameters themselves could have associated units, data type and display format. More generally FITS can have a variety of additional records not allowed for in VOTable, for example the HISTORY records in a FITS file. This is in itself a serious omission because for example things like sky coverage, wavelength range or sensitivity could easily be part of a query, if a VOTable is used in that way. It also prevents reversibility of the transformations into and out of VOTable such as FITS table ==> VOTable ==> FITS table.

How is the mapping of VOTable columns to FITS table columns defined. Is it by name or by position?

Section 2.1 says that the VOTable is completely compatible with the FITS Binary Table. I guess that strictly that is true, but in the wild one often encounters FITS tables which use extensions to strict Standard, especially the variable length, multi-dimensional array, and substring array conventions, described in Appendix B.1 through B.3 of the Standard. I think it is highly desirable to support all these. I noted the B.3 problem above. The variable length B.1 facility is used extensively in X-ray astronomy: I think it is covered by the variable-sized array facility in VOTable, but I'm not an expert so would welcome confirmation. I don't think the multi-dimensional array facility of B.2 is covered, though it would not be difficult.

Column Descriptors

It may be useful to

 String Arrays

Section 1.1 Example line 13 has datatype="A" arraysize="10"
This notation has perhaps been copied directly from the FITS Binary Table spec in which the only way of specifying an string of characters is to have an array of them. Although the C programming language has the same limitation, as far as I know all other programming languages uses by astronomers, including Perl, Python, Java, C++, and Fortran (at least all versions later than Fortran66), have a well-defined concept of a character string, and most of these languages can handle an array of character strings.

The need for arrays of strings in FITS tables has been recognised, and Appendix B.3 of the FITS Standard describes what one can only call a fudge to achieve this. I think it would be much better for VOTable to have a concept of a character string, and instead use an attribute such as length="10" which would allow the definition of a array of strings in a manner exactly as for all other data types. It would also permit the translation of any FITS file which uses the "Substring Array" convention (as defined in Appendix B.3); since this notation has been quite extensively used in FITS files written in various observatories, it surely needs to be supported.
I don't see any difficulty in translating the "nA" type in a FITS binary table to a string, rather than to an array of characters.

I am also confused by section 3 para 4 which says that "character strings will be padded with null characters if they are shorter than the specified length". It is not clear whether this specifies what is supposed to happen when a FITS table is converted to XML, or when the XML is parsed. There is an obvious need to handle strings of variable length (such as in SQL VARCHAR fields) but FITS has only an uneasy compromise between C pseudo-strings (fixed maximum length, null-terminated if short), and Fortran ones (fixed length, space padded in all practical sitations when lengths do not match). I think the VOTable specification needs to decide whether its strings are tryly fixed (as in Fortran) or fully dynamic (as in SQL, Perl, Python, etc.) in length, and formulate its conversions from FITS to XML accordingly.

Physical Units

Section 1.1 Example line 15 has unit="degrees"

The IAU has in its style manual "Recommendations concerning Units", a copy of which can be found at http://www.iau.org/IAU/Activities/nomenclature/units.html. This has radians as the standard unit of planar angle. Obviously angles measured in radians are not easy for humans to read, but then neither are XML files. If I understand it, VOTable is an XML format to facilitate data interchange, and is therefore really intented only to be machine readable.

The fact that XML is based on ASCII text is a convenience because that avoids problems with endian-ness which make binary formats less portable. For machine reading, radians seem much more sensible as these are the units in which just about all programming languages do their trigonometry, so this avoids unnecessary conversions to/from degrees. Indeed for maximum convenience of humans, at least the astronomical sub-species, even degrees are not optimum, as sexagesimals are so widely used. The OGIP Memo 93-001, reachable from http://legacy.gsfc.nasa.gov/docs/heasarc/ofwg/ofwg_recomm.html  set out some standards for units in FITS files which are fairly widely followed: this allows degrees as a alternative, but specifies that it should be specified by the string "deg" not "degree" and certainly not "degrees" (all units ought to be singular).

Datatypes

Meaning of datatype

By the way, what exactly is the datatype attribute of a FIELD (table on p3) defining:

If (b) (which is the obvious interpretation) then precisely what do the datatype options (in the table on p3) mean when the table is stored as a TABLEDATA or CSV? (think particularly of options X, B or I, though conceptually the issue arises with any of them).

I'm happy with the proposed options for datatype and with the TABLEDATA and CSV table representations. However, together they do beg the above question, and I think that the slightly non-intuitive answer is (a).

Double

The table in section 2 has both "D" and "F" as a marker for double type. The FITS Standard has only "D", it is not clear to me why a synonym is required.

Bit

Section 3 para 4 says that the arraysize attribute specifies the number of 8-bit bytes, but the equivalent FITS binary table specifier, "rB" has "r" giving the number of bits. The number of bytes has to be derived as int((r+1)/8). If you specify the number of bytes, you don't know exactly how many bits are in use. I suggest using the FITS notation here; the alternative would be to have a special attribute for this data type.

Boolean

How are Boolean values denoted in TABLEDATA? Are upper as well as lower case "T" and "F" allowed?

Complex

If a cell contains an array of complex numbers then there are, in principle, several ways in which the values could be ordered. For example, for a 3 element complex array:

real[1],imaginary[1],real[2],imaginary[2],real[3],imaginary[3]
real[1],real[2],real[3],imaginary[1],imaginary[2],imaginary[3]

Technically, there are also 2 additional options in which the imaginary part comes first. The standard should specify which of these orders the values should occur in - the first seems the most likely. This is footling pedantry of the worst sort, but if it is written into the standard then there is no scope for ambiguity.

Coordinate systems

It is not clear how COOSYS is tied up to the columns in the table which define the coordinate system; the ID attribute could be used for this.

If the VOTable is used outside Astronomy then additional coordinate systems would be needed. Even between co-operating institutes there may be specialised coordinate systems. Defining "system" as an enumerated attribute may be too restrictive.

On a related topic, catalogues created by detecting objects in CCD images and digitised photographic plates usually contain both the positions measured for the objects in the CCD frame or plate and the celestial coordinates derived from them. It may be useful to store the coefficients used to make these transformations in a standard way.

A standard name for TIME may also be necessary, for example for any time-series work on variable stars, as well as STP and Solar work. CDF uses EPOCH. Rather than this, something like TIMESYS may be better to avoid confusion.

By the way, section 4.3 specifies a date in the form "2002-01-31T12:00:00:00" (though I think the last colon may be a mis-print). The reference is to the FITS Standard, but derived from ISO8601. I think it would be better to refer to the primary source. Unfortunately it costs money to get ISO8601 from the International Standards Organisation, but a useful summary exists here: http://www.cl.cam.ac.uk/~mgk25/iso-time.html

Null values

Sections 3.4 and 4.1.1 covers NULLs, in part. It is quite important to get this right, since missing values are common in many astronomical tables. The FITS Standard specifies the use of NaN for null values in floating-point fields, but obviously for integer types there is no equivalent, as all bit-patterns may be valid values, so there is an alternative notation, allowing the grabbing of some unlikely value (such as -99) to represent missing information. This has always seemed like a kludge to me, and gets difficult with 8-bit integers, when it can be hard to give one of just 256 values up for this purpose. I would have thought that XML would have a standard way of expressing nulls, but I haven't been able to find it. Since the XML stream is representing integers by strings of characters, there is no need to reserve a paricular integer such as -99, it could just as well be a string such as "NaN". It is not clear to me why the "invalid" attribute is needed, as distinct from merely "null".

Is <CELL></CELL> allowed?

For character strings it is proposed to use the FITS representation, using an ASCII NUL value (zero) as the first character. I can see pragmatic reasons for this, but feel that an out-of-band mechanism would be better, especially as null bytes have a habit of causing problems in data transfers. It also avoids forcing the XML parser to read the contents of each string to see whether it is null or not. If, as I proposed above, the VOTable allows arrays of strings, a better null representation is also needed, since one might want to declare missing just some strings in an array of strings.

Sort Order information

Section 2 (p3) The rows in a VOTable are not necessarily unordered, and in the case where they are ordered it would be useful to have a mechanism to indicate this. Note that the ordering of tables is not just, or even primarily, a presentation issue. Rather, knowing that a catalogue is sorted on some column allows a program reading the catalogue can make fast `range' selections on this column (binary chops etc).

One thrust of the VOTable document seems to be that the VOTable standard is a general mechanism for storing tabular and catalogue data in astronomy, not just (for example) for representing small tables extracted from a remote archive and transmitted across the internet prior to display. Thus, the VOTable standard should be suitable for storing large catalogues, where preserving information about sort order in order to facilitate fast `range' selections is important.

Obviously, the default where no information about sort order is included in the catalogue metadata, should be that the catalogue is unordered.

Similarly, there appears to be no provision for storing indices on any of the columns, which again allows a program reading the catalogue to make fast range selections on indexed columns.

The RESOURCE element can contain several TABLEs, so I can't see any bar to including additional tables containing simple indices: lists of row numbers in the original table arranged in an order corresponding to a sort on some column (or to a selection of a subset, for that matter). An additional bit of syntax might be required to relate the indices to the original table.

Perhaps more complex (2D) indexing schemes can be deferred to version 2.

Use to describe data resources

Section 2 of the proposal says "a VOTable document may be used to express a question as well as an answer...the specification of class as an implicit request for instance."

I'll comment on using VOTable as a query format below, but here I suggest that VOTable is useful for describing data resources in detail. That is, one could use a collection of VOTable documents (or one such document with a large number of RESOURCE elements) in a resource directory to say exactly what queries are possible on various tables.

This usage is attractive because VOTable describes columns of tables in a generic way, and if we have software to handle that kind of metadata we might as well re-use the software for all cases where columns need to be described. It's not clear that VOTable is the best arrangement for a resource directory.

The VOtable material in the directory has to describe the external view of a table in a data-service. That is, the columns in the VOTable header may not be exactly the columns held in the database; there may well be translations going on as queries are accepted and results are returned. The translations need not be symmetric:
ie. the set of columns used in a query need not be the same as the set of columns that can be in the output. Therefore, to make VOTable into a generic representation of a tabular-data resource, the FIELD elements need some annotation saying in which cases they can be used. I suggested adding values to the set allowed for the
type attribute: "in" for fields that can be used in a query; "out" for fields that can be used in results. For fields that can be used in both cases, the field element is duplicated with different types on the two instances.

We would also need some agreement on what the units mean in a resource catalogue. Are the stated units what you get, with no choice in the matter? Or are they what you get unless you specify some conversion.

Use to define a query

This follows on from the use to describe a data resource. The procedure is logically as follows.

1. Copy the VOTable document describing the resource.

2. Delete zero or more of the fields of type "out", leaving the subset you want in the output. (This is equivalent to a SELECT clause in SQL).

3. Add extra FIELD elements of type "hidden" to express extra parameters of the query. Example: RA and Dec of search centre.

4. Adjust the units of the fields for the output to the values you prefer (clearly, this only works if the service supports unit conversion).

5. Add to the document constraints on the fields of type "IN". The constraints are equivalent to a WHERE clause in SQL.

The operation at step 2 involves only selection of fields, not modification, which I understand to be easier with XML tools. This is why I suggested that fields that can be used both for input and output should be duplicated instead of having type "inout". If the "inout" type was used, then some fields might need to be changed to
type "in" in setting up the query.

Some extra element in VOTable is needed to select ordering of output rows. I suggest a SORTORDER element to be included as a child of TABLE.

The difficult bit is defining the syntax for the constraints, call it the WHERE element as a working title. I don't know exactly what the language for this should be, but I can give some ideas.

The WHERE element has to be some Boolean expresion in which the operands are the names of fields. (Alternatively, UCDs could be used as operands, but why bother when VOTable goes out of its way to name the fields?)

The WHERE element has to be able to express any query on the given operands. This mean firstly that it needs a rich set of operators and secondly that some standard, built-in functions (e.g. for spherical astronomy) are needed.

The WHERE element isn't a complete query in XQuery. The XQuery language seeks to define input and output structure in XML, and that's redundant for the VOTable case since we have standardized the output forma and the inputs are relational not hierarchical. In fact, most of XQuery is redundant for this usage.

The WHERE element isn't just ISO SQL. There are too many things you can't say in raw SQL (try expressing a constraint on distance from a point on a sphere, for example).

The WHERE element could allow SQL as an option. This covers queries that don't need advanced functions and exploits the fact that a lot of data services may have SQL engines attached. For example:

<WHERE>
<SQL>
r_mag > 18 OR z > 0.6
</SQL>
</WHERE>

(assuming that "r_mag" and "z" are the names of fields).

The WHERE clause could use just the where-clause syntax of XQuery. The advantages are (a) that that syntax has been well-vetted by specialists and is less likely to give subtle problems; and (b) that we might be able to scavange some parsing code from XQuery implementations.

The WHERE clause could use the syntax defined for constraints in ASU. That syntax is rich in operators, but doesn't have specialised functions.

How would the query express a join between tables?

In general, having a small set of options for the language is good future proofing. The WHERE clause should have a single child-element which identifies the constraint language (like the SQL element in the example above).

CSV

As written the example and explanation are a bit confusing. I think that what is intended (and if not it should be) is that:

These rules apply throughout the CSV, including any header lines (the number of which is indicated by headlines), which are skipped over.

(The present text implies that any header lines to be skipped over must end in \n, irrespective of the row separator used for the rest of the table, which seems perverse. Of course, in most tables \n will be the row separator.)

The above comments for TABLEDATA about unparsable values and empty cells (here adjacent occurrences of the separator character) also apply.

- It is not entirely clear that enclosing quotation marks should automatically be ignored and removed (if they're just going to be ignored then why are they there?). Which quotation marks: single quotes, double quotes or both?

 

Clarifications of Meanings

Paths

In section 4.3 (and 6) the syntax href=file://mydata.dat/ is described. As an aside: it is not clear to me in what circumstances quotes are needed around the argument of the href attribute. More importantly, I think it will be important to have a notation both for absolute and relative paths: the example shown looks like an absolute path, but I'm not sure. The use of relative paths is certainly convenient in web pages, as it means that links to dependent files, such as GIF images, on the same directory retain their validity even if the whole collection of files is copied elsewhere. But there may be circumstances when an absolute file path is needed.

MAX and MIN

We assume that MAX/MIN refer to the maximum/minimum allowed values rather then the actual max/min values occuring in the table, but this is not clera from the text. This functionality is available in XML Schema.

Editorial Comments

A tree-diagram (for example the usual XMLSpy-type diagram) would be useful to show the structure of the table.

Data should be plural.

Use of a phrase like "will cause an exception to be thrown" (section 3.4) is rather out of place in the description of a format. The VOTable standard should not prescribe how a program reading a VOTable should behave when it encounters an invalid table; its behaviour will depend, in part, on its function and circumstances. It might indeed throw an exception and abort, or it might issue a warning, attempt a guess at the missing datatype and carry on, or even, in some circumstances, attempt to carry on without issuing a warning. It is for the program's designers, not the VOTable standard, to decide what behaviour is appropriate.

The version of the encoding such as gzip should perhaps be specified.