From Vistas in Astronomy, 40, 395 (1996)

Chapter 1
Neural Networks and Information Extraction in Astronomical Information Retrieval

S. Lesteven

Observatoire Astronomique, Strasbourg, France

P. Poinçot

Observatoire Astronomique, Strasbourg, France

F. Murtagh

University of Ulster, Magee College, Northern Ireland

We describe results obtained in prototyping a Kohonen self-organising
feature map as a user interface to a large document collection.  
We then discuss the provision 
of various tools which will be used in quality control of databases and 
catalogs produced from, and cross-referenced with, the online astronomical
literature; and which will be used in future information seeking and 
processing intelligent agent technologies. 

(Postscript version of Paper)

1.1  Introduction

We firstly examine a Kohonen self-organizing feature map (SOFM) interface to large document collections. Low-dimensional representations of documents or other related objects have been used for a long time. Factor space (Ossorio, 1966) and latent semantic indexing (Deerwester et al., 1990) are two examples. The Kohonen map is a method with similar objectives. It offers the advantage of presenting results in a display-friendly manner (a point made in Lin et al., 1991, and Murtagh and Hernández-Pajares, 1995).

Secondly, we look at perspectives for information extraction in astronomy. Astronomy offers a relatively well-demarcated domain and set of themes which have already been comprehensively indexed and cross-linked albeit manually. Compared to other domains (Lewis and Jones, 1995), astronomical texts offer particular challenges but user needs are often more clearcut. Thus astronomical object names, wavelength ranges, instruments and observing sites are all topics which are relatively specific and which serve to characterize the text under consideration. Template-driven information extraction (Cowie and Lehnert, 1993) along these lines is looked at below.

1.2  Letting a Neural Net Organise the Astronomial Literature

In its most widely used incarnation (see Murtagh and Hernández-Pajares, 1995) the Kohonen self-organizing feature map (SOFM) with a regular grid output representational or display space, involves determining vectors wk, such that inputs xi are parsimoniously summarized (clustering objective); and in addition the vectors wk are positioned in representational space so that similar vectors are close (low-dimensional projection objective) in representation space: k, k', k'' {in}{ (r,s) | r = 1, ..., R; s = 1 ..., S } .

Associate each xi with some one wk such that k = argmin ||xi – wk ||
Low-Dimensional projection:
||wk – wk'|| < ||wk – wk'' || ==> ||k – k'|| <= ||k – k'' ||

By way of example, R = S = 10 and the output representation grid is a regular, square one. The metric chosen for norm ||. || is usually Euclidean. We have xi, wk {in}Rm for some dimensionality, or cardinality of attribute-set, m.

Iterative algorithms for clustering are widely used, requiring an initial random choice of values for wk to be updated on the basis of presentation of input vectors, xi. In the SOFM method, at each such update the low-dimensional projection objective is catered for by updating not just the so-called winner wk, but also neighbors of wk with respect to the representational space. The neighborhood is initially chosen to be quite large (e.g. a 4 ×4 zone) and as the epochs proceed, is reduced to 1 ×1 (i.e. no neighborhood). An epoch is the term used for a complete set of presentations, and consequent updates, of the N input vectors. The result obtained by the SOFM algorithm is sub-optimal, as also is the case usually for clustering algorithms of this sort (k-means, partitioning) and quite often for dimensionality-reduction methods (Sammon's mapping). A range of studies showing how well the SOFM method performs compared to these methods can be found in Murtagh and Hernández-Pajares (1995).

An SOFM map of a particular domain has been used by Kohonen and others to produce a semantic interpretation of that domain thereby facilitating information retrieval (Honkela et al., 1995; Lin et al., 1991; Honkela et al., 1996). The WEBSOM site (Kohenet et al., 1996) contains an example of a zoomable map relating to news items distributed in the newsgroup.

In our work, 2063 bibliographic references from 1989 were used, and a semantic map constructed based on the associated set of 463 index terms. The latter were NASA-STI keywords. Further details of this data may be found in Lesteven (1994, 1995).

An SOFM map of dimensions 20 ×16 was used and is shown in Fig. 1:

Kohonen SOFM interface to astronomical literature

The Kohonen map was programmed for this task (alternatives include the SNNS package, Zell 1993, which was also appraised; code in Murtagh and Hernández-Pajares; and Kohonen's SOM_PAK package). Attention was devoted to learning convergence and stability issues, especially since the input data was high-dimensional (463-valued) and near-binary. Based on predominant keywords, the nodes of the resulting map were characterized. Currently the map is not zoomable (cf. Kohonen et al., 1996), nor hierarchic, and consequently the navigation aids available in Kohonen et al. (1996) are not implemented. Computational time for training is lengthy (an hour or two), but not overly so. In fact, there are possibilities for making use of the very sparse nature of the input data for algorithmic improvement. Average computational complexity of similarity calculation using inverted lists is discussed in Murtagh (1985). Another approach is hash coding referrred to in Honkela et al. (1996).

This SOFM map is not yet publicly accessible but it is intended that it will be soon. We will seek to extend the data handled to years other than 1989. Another keyword system, e.g. that of ADS (Astrophysical Data System) or that of Astronomy and Astrophysics, may be better, and this will have to be investigated. The evolution of the literature and of astronomical research topics, over time, will need to be catered for. It may be adequate to update the Kohonen maps at regular intervals, but if so the user will loose the familiarity gained with the previous map.

1.3  Astronomical Databases and Catalog Creation through Information Extraction

Relatively simple rules for information extraction may work well, as is evidenced from Hayes (1994); and Cowie and Lehnert (1993).

In SIMBAD (see object names are currently found in bibliographic documents by human compilers of catalog information subsequently associated with the documents. We are using the standard Unix lexical analyzer, Lex, to determine object names automatically. Lex generates programs to carry out lexical analysis, partitioning the input text into meaningful tokens or lexemes. Regular expression control the pattern matching. Rules are specified for object names such as NGC 206-231, HD 106225 and M31. An initial set of rules lead to a success rate of around 80% correctly found object names. A small set of titles, keywords and abstracts were used for these tests. False detections include partial object names; names of object classes such as RR Lyr; and instrument or project names such as HEAO 1 or HEAO-1. Further refinement of the rules applied will be carried out.

Names of space missions, ground-based observing facilities, telescopes and detectors (ISO, ESO 3.6m, Ronchi grating, EFOSC/PUMA 2, etc.) are also feasible candidates for pattern matching. Lists of terms can be assembled. In the longer term, ambivalence and multiple meanings may be problematic (Heck, 1995).

T:The Infrared Space Observatory 
A: The development of the Infrared Space Observatory (ISO), which is planned to
 be launched into an elliptical orbit by the ESA in 1992 or 1993, is discussed.
 The characteristics of the scientific instruments and the satellite's payload 
module are described and the ISO observing times and flux detection capabilitie
s are presented. The ISO plans include a cryogenically cooled telescope with an
 effective aperture of 60 cm, four complementary focal plane instruments, a 3-1
7-micron camera, a photopolarimeter (3-200 microns), and two spectrometers cove
ring the range of 3-180 microns. 
T:Ultraviolet observations of starburst systems with Astro and the Hubble Space
A: The capabilities of the Astro observatory and the Hubble Space Telescope (HS
T) for UV astronomical observations are discussed. The advantages of using UV f
or studying extragalactic problems are reviewed, including sensitivity to hot s
ources, the rich UV line spectra, sensitivity to interstellar dust, and dark UV
 sky background. The HST and Astro instrumentation and plans for launching the 
systems are described. Possibile uses of HST and Astro include the study of mas
sive stars in the Galaxy and nearby galaxies, interstellar medium, the UV exces
s component in old stellar populations, deep UV surveys (filter imaging and sli
tless spectroscopy), active galactic nuclei, quasars, and starburst systems. 
T:AXAF and XMM - The X-ray observatories for the 1990s 
A: Plans for NASA's Advanced X-ray Astronomy Facility (AXAF) and ESA's X-ray Mu
lti-Mirror Mission (XMM) are discussed, stressing their potential use for the s
tudy of galaxies. The AXAF is a space observatory with an expected lifetime of 
over 15 yr, with periodic refurbishment by the Space Station. The AXAF plans in
clude a CCD imaging spectrometer, high resolution camera, Bragg crystal spectro
meter, X-ray calorimeter, and low and high energy transmission gratings. The XM
M is planned for launch in 1998 by Ariane IV into a deep-earth orbit with a pay
load containing 4 telescope CCD arrays and imaging gas scintillation proportion
al counters. Two of these telescopes are expected to contain reflection grating
s and Bragg crystal spectrometers. The two systems are compared, and the spectr
oscopic studies of galaxies which the systems are expected to provide are exami
Sample texts.

Other types of information which can be addressed, but answered with difficulty, include: search for astronomical coordinates or other positional information if present; and information on wavelengths or wavelength interval of study.

Earlier work used terms derived from the IAU Thesaurus as a contolled vocabularly to search astronomical texts – see Murtagh (1996). A set of around 1500 Hubble Space Telescope observing proposal abstracts was assessed using an SOFM map. While successful, much useful information in the IAU Thesaurus (``see also'' links and hierarchical structure, for instance) was not availed of.

1.4  Conclusion

We have started investigation of a novel user interface mechanism for astronomical documentation and textual data, which is easy to use and may offer added robustness vis-à-vis data from many different sources. This mechanism is based on a cartographic representation. Among open problems are the integration of supervised and unsupervised modes of operation (the latter alone being currently used).

Automating the task of information extraction will aid enormously in the quality control of large documentation-based data collections. Information extraction toolsets will also provide the basis for the work of intelligent information gathering agents which will help the researcher in his/her work.

1.5  References