Hypermedia Research Unit


XML / RDF Thesaurus Formats

Introduction

A variety of formats have been developed over the years for the storage and interchange of thesaurus data. In many cases perceived over complexity can be a stumbling block to widespread adoption. Knowledge organisation systems such as thesauri can become complex but the basic format used to represent them need not be. We describe here two lightweight and flexible XML formats which have been developed and used within University of Glamorgan for the purposes of representing and storing thesaurus data in a simple and readily usable manner.

The use of XPATH querying techniques is also demonstrated using a series of small example queries. The examples are not presented as a universal approach to searching and browsing of online knowledge organisation systems, but as demonstrations of some useful xpath techniques that may be employed on relatively small datasets. Various schema files, demonstration data files and usage examples are included which may be viewed online, downloaded, adapted or extended as necessary to suit your own work. The formats described facilitate the development and viewing of complete thesauri using a standard text editor and a web browser.

In section 3 We briefly describe the Simple Knowledge Organisation System (SKOS) RDF format and demonstrate the application of similar XPATH querying techniques in the context of this format.

Table of Contents

1. Standalone Thesaurus Format

This XML format allows the modelling of a complete thesaurus within a single file, using familiar tag names based on thesaurus standards. The id for the term may be a numeric or textual unique identifier, in some cases it may just be the term itself. The schema representing this format is available for download / viewing here.

Extract from st_rock.xml:

<?xml version="1.0"?>
<terms xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="./st.xsd">
  .
  .
  <term id="t8">
    <label>slate</label>
    <bt idref="t3"/>
    <uf idref="t13"/>
    <sn>Fine grained metamorphic rock easily split into smooth, flat plates</sn>
  </term>
  .
  .
  <term id="t13">
    <label>slatestone</label>
    <use idref="t8"/>
  </term>
  .
  .
</terms>
View the entire file

1.1 Demonstration Data

For demonstration purposes we use the following small hierarchical thesaurus of rock materials to visualise the syntax for the various data links (the format does also support poly-hierarchical links).

rock
. igneous rock
. . basalt         [UF --> lapis basamites]
. . granite        [UF --> moorstone]
. metamorphic rock
. . marble         [RT --> limestone]
. . slate          [UF --> slatestone]
. sedimentary rock
. . limestone      [RT --> marble]
. . sandstone      [UF --> arenaceous rock]

A number of small demonstration files in ST format are available for downloading/viewing:

  • Generic example - a short example showing basic link types and XML syntax (XSLT formatted version)
  • AAT Extract - a very short extract from an old version of the Getty Art & Architecture Thesaurus (XSLT formatted version)
  • Rock types - the short hierarchy as described above (XSLT formatted version)
  • ADL FTT - Alexandria Digital Library Feature Type Thesaurus (XSLT formatted version) [131KB]. This data is based on the 03/07/2002 version of the Feature Type Thesaurus, however the following modifications were made for our own purposes:
    • We have extracted only the information we needed, therefore input/update dates and definition notes are not present in the files.
    • The term 'commonwealths' was a non-preferred term with an associated scope note. For our purposes we have removed the scope note, to make the data conform to our schema.
    • The term 'land parcels' was a preferred term with no broader term or narrower term relationships. For our purposes we have made it a narrower term of 'administrative areas', to make the data conform to our schema.
    The reader is therefore directed to the ADL website for the most complete version of the data for this thesaurus.

1.2 Use of XPATH Expressions

The standalone thesaurus format allows for the use of XPATH commands to directly locate data within the document. The following example expressions have all been tested against the rock types thesaurus described previously.

XPATH Expression Description Results
//term[bt/@idref='t3']/label
Locate all NT links for term id 't3' marble, slate
//term[not(use)]/label[starts-with(.,'s')]
Find PREFERRED terms starting with 's' sedimentary rock, slate, sandstone
//term/label[contains(.,'stone')]
Find ANY terms containing 'stone' limestone, sandstone, moorstone, slatestone
count(//term/use)
Count the NON-PREFERRED terms in the document 4

These and other commands are used in an online search demonstrator showing how the thesaurus structure modelled by the XML format may be visually represented and traversed. The demonstrator performs a search on the ADL Feature Type Thesaurus using XPATH expressions and terms may be clicked on to view and dynamically traverse the hierarchical structure.

1.3 Use of XSL transformations

XSL transformation may be used to present the data contained in the document in various forms. Our transformation example creates a traditional alphabetical listing of terms, incorporating active hyperlinks allowing the user to dynamically navigate the document simply by clicking on terms. The XSL file used to create this transformation is available for viewing / download here.

1.4 Alternative version of the standalone format

An experimental alternative version of the standalone thesaurus format was also developed, incorporating non-preferred terms as sub-properties of concepts rather than as freestanding terms in their own right. This resulted in a simpler schema for the data. The schema describing the alternative format is available here, and the previously described rock materials thesaurus formatted using the alternative format is available here. Admittedly as a preliminary experimental schema there is room for improvement but as a basic extensible tagged data format for representing thesaurus data it is perfectly adequate.

2. Distributed Thesaurus Format

Based on elements from the standalone thesaurus, this experimental XML format is intended to facilitate easy collaborative development. Thesaurus data is split into small files (one file per term) which may be distributed across remote sites, allowing multiple institutions to work independently on individual hierarchy fragments covering the terminology within their own area of expertise. The format employs common web hyperlinks to seamlessly merge the distributed body of work for display and navigation purposes. The schema representing this format is available for viewing / download here.

slate.xml:

<?xml version="1.0"?>
<term xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="http://www.comp.glam.ac.uk/~FACET/formats/dt/dt.xsd">
  <label>slate</label>
  <bt href="metamorphic.xml"/>
  <uf href="slatestone.xml"/>
  <sn>fine grained metamorphic rock easily split into smooth, flat plates</sn>
</term>

slatestone.xml:

<?xml version="1.0"?>
<term xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="http://www.comp.glam.ac.uk/~FACET/formats/dt/dt.xsd">
  <label>slatestone</label>
  <use href="slate.xml"/>
</term>

Here is an example of the distributed thesaurus, showing how it is possible to navigate a series of data files located at multiple sites. The XSL file used to create this interface is available for viewing / download here.

3. Simple Knowledge Organisation System (SKOS) RDF format

SKOS is an open collaboration initiated by SWAD-Europe, developing specifications and standards to support the use of knowledge organisation systems (KOS) on the semantic web. See the SKOS homepage for more information on this format. We have experimented with the ADL Feature Type Thesaurus data, this time using a custom XSL transformation to convert the XML data to SKOS RDF format. The SKOS format RDF file is available for viewing here.

Extract from skos_FTT.rdf:

<skos:Concept rdf:nodeID="FTT203">
	<skos:inScheme rdf:nodeID="ADL-FTT"/>
	<skos:prefLabel>reefs</skos:prefLabel>
	<skos:altLabel>barrier reefs</skos:altLabel>
	<skos:altLabel>fringing reefs</skos:altLabel>
	<skos:broader rdf:nodeID="FTT96"/>
	<skos:narrower rdf:nodeID="FTT547"/>
	<skos:related rdf:nodeID="FTT206"/>
	<skos:related rdf:nodeID="FTT5"/>
</skos:Concept>

An online search demonstrator similar to that produced for the standalone thesaurus format described earlier is available for experimentation. This demonstrator allows selection from the following SKOS format thesauri:

  • Feature Type Thesaurus - the data file as described in this section.
  • GCL: Government Category List (Version 2.1). This file, generated by Alistair Miles was downloaded from http://isegserv.itd.rl.ac.uk/skos/gcl/ for experimentation purposes only, and the reader is directed to the GOVTALK website for the most current dataset.
  • APAIS: Australian Public Affairs Information Service (April 2004 edition). This file, generated by Alistair Miles was downloaded from http://isegserv.itd.rl.ac.uk/skos/apais/ for experimentation purposes only, and the reader is directed to the APAIS website for the most current dataset.
  • GSAFD: Guidelines on Subject Access to Individual Works of Fiction, Drama etc. (2nd edition, 20/07/2004). The SKOS format version of this data file was downloaded from http://www.oclc.org/research/projects/termservices/resources/gsafd.htm for experimentation purposes only.

3.1 Use of XPATH Expressions

The demonstrator is again using simple XPATH expressions to query the data, and terms may be clicked on to view and dynamically traverse the hierarchical structure. The following examples were run against the Feature Type Thesaurus data set:

XPATH Expression Description Results
//skos:Concept[skos:broader/
@rdf:nodeID='FTT203']/skos:prefLabel
Locate all NT labels for term id 'FTT203' coral reefs
//skos:Concept/skos:prefLabel
[starts-with(.,'v')]
Find PREFERRED terms starting with 'v' valleys, viewing locations, volcanic features, volcanoes
(//skos:prefLabel | //skos:altLabel)
[contains(.,'reef')]
Find ANY terms containing 'reef' coral reefs, reefs, barrier reefs, fringing reefs
count(//skos:Concept)
Count the concepts in the document 210